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EXECUTIVE SUMMARY 


Steganography is a relatively new mode of communication that is far less developed and understood in comparison 
to other research fields that offer privacy and security, such as cryptography. In particular, despite its fundamental 
importance, the very basic question of how big a payload it is secure to embed in a given cover object to prevent an 
adversary from detecting the presence of a secret message, is still open. The first result in this direction appeared 
in [61, 88, 89]. Through a series of articles published in 2004-2008, it has been established that steganographic capacity 
of perfectly secure stegosystems grows linearly with the number of cover elements (pixels) — secure steganography has a 
positive rate. In practice, however, neither the adversary (Warden) nor the steganographer has perfect knowledge of the 
cover source and thus it is unlikely that perfectly secure stegosystems for empirical covers, such as digital media, will 
ever be constructed. This justifies the first topic on which the PI focused — the study of secure capacity of imperfect 
stegosystenis. 

In 2006 and 2007, theoretical results appeared concerning the paradigm of batch steganography (50, 51] (embedding by 
dividing payload among multiple covers to minimize statistical detectability). These results, supported by experiments 
with blind steganalyzers, pointed to an emerging paradigm: whether steganography is performed in a large batch of cover 
objects or in a single large object, the size of the secure payload grows only sub-linearly. In particular, the secure payload 
appeared to be proportional to the square root of the number of pixels in the cover image. This so-called Square-Root Law 
of Steganography is the first principal achievement reported here. In Section 1, it is formally established for imperfect 
stegosystems that hide messages in covers modeled as a stationary Markov chain and when the embedding changes are 
mutually independent. An important new part of this contribution, explained in Section (2), is a complete characterization 
of perfectly-secure cover sources with respect to a given embedding operation. The square root law has been confirmed 
experimentally on real imagery and several different embedding and steganalysis methods in Section 3. Among other 
contributions, it has been established that secure payload of imperfect steganographic systems is completely described 
using the so-called root rate that depends on the steganographic Fisher information, which forms a security descriptor 
equivalent to the Kullback—Leibler divergence (Section 2). The Fisher information can be used for optimizing the design 
of steganographic systems, for benchmarking, and comparison of covert communications schemes. 

Having established the fundamental limitations of covert communication in empirical (incognizable) covers, the PI then 
directed her effort towards a general framework within which steganographic schemes can be built in practice (Sections 5- 
6). By abandoning the concept of preserving the cover distribution (as the distribution is incognizable anyway), the 
steganography is instead casted as a source coding with fidelity constraint. The so-called minimal embedding impact 
steganography is formulated using the concept of distortion that is fundamentally tied to statistical detectability. The 
PI resolved the problem of embedding while minimizing an essentially arbitrary distortion function by formulating the 
embedding problem within statistical physics. The resulting “Gibbs construction” (Section 6) is an elegant framework 
within which one can study, design, and optimize steganographic schemes. Syndrome-trellis codes were proposed as a 
general approach to practical embedding. Finally, the PI managed to tie distortion to statistical detectability by converting 
the problem of minimizing detection to a parameter optimization problem [24]. Thus, the developed framework provides 
an enclosed and complete theoretical basis for further development of steganography in empirical covers. 

This report makes use of the Iverson bracket [J] defined to be 1 if the logical expression J is true and zero otherwise. 
The binary entropy function h(x) = —x log x — (1 — x) log(1 — x) is expressed in bits. 
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1. THE SQUARE RooT LAW OF SECURE STEGANOGRAPHIC PAYLOAD 


In steganography, the sender communicates with the receiver by hiding her messages inside innocuous looking (cover) 
objects. Most practical steganographic methods embed messages by slightly modifying individual elements of the cover, 
obtaining thus the modified stego object that conveys the hidden message. The goal here is to make the stego objects 
statistically indistinguishable from covers — a passive warden, who is merely inspecting the traffic, cannot construct a 
detector of stego objects that would work better than an algorithm that makes random guesses. The assumption is 
that, up to a secret shared key, the warden is familiar with all details of the steganographic scheme: this is the so-called 
Kerckhoffs’ principle, which is also interpreted to mean that the warden has complete knowledge of the probabilistic 
distribution of cover objects. 

Statistical detectability of embedding changes depends on their character and extent. Intuitively, it should be possible 
to send short messages with lower risk of being detected than long messages. From a practical point of view, the sender 
needs to know how long a message she can embed for a chosen risk -- she needs to know the steganographic capacity of 
the stegosystem. Unfortunately, determining the steganographic capacity analytically for real digital media objects, such 
as digital images, is very difficult even for the simplest steganographic paradigms, such as LSB (Least Significant Bit) 
embedding. The reason is the lack of accurate statistical models. 

One may intuitively expect the steganographic capacity to be linear in the size of the cover object by referring to a 
similar result for capacity of noisy communication channels. This is, indeed, valid if the stegosystem is perfectly secure, 
since there is no possible detector [6], 12]. In view of the absence of provably secure steganographic methods for real 
digital media, it makes sense to investigate steganographic capacity of imperfect embedding methods for which detectors 
exist and inquire about the largest payload that can be embedded using their e-secure versions in the sense of Cachin {9}. 

The fact that steganographic capacity is most likely sublinear was already suspected by Anderson [1] in 1996: 


“Thanks to the Central Limit Theorem, the more covertext one gives the Warden, the better he will be 
able to estimate its statistics, and so the smaller the rate at which [the steganographer] will be able to 
tweak bits safely. The rate might even tend to zero...” 


The analysis of batch steganography and pooled steganalysis by Ker [51] tells us that steganographic capacity of imperfect 
stegosystems only grows as the square root of the number of communicated covers. This result could be interpreted as the 
square root capacity law for a single image by dividing it into smaller blocks. The capacity result, however, was obtained 
with the assumption that the individual images (blocks) form a sequence of independent random variables, which is clearly 
false not only for images but also other digital media files. The main contribution reported in this section is to establish 
the same law for the simplest form of dependence that enables analytical reasoning -- it will be assumed that individual 
elements of the cover (e.g., pixels) follow stationary Markov chain. 


1.1. Basic assumptions. This section introduces notation and three basic assumptions under which the SRL (Square- 
Root Law) is proved. The first assumption concerns the impact of embedding. It is postulated that the stego object is 
obtained by applying a mutually independent embedding operation to each cover element. This type of embedding can 
be found in majority of practical embedding methods (see, e.g., [37] and the references therein). The second assumption 
concerns the model of covers. The individual cover elements are required to form a first-order Markov chain because 
this model is analytically tractable while allowing study of more realistic cover sources with memory. Finally, the third 
assumption essentially states that the steganographic method is not perfectly secure. 

Throughout the report, A = (a;;) denotes a matrix with elements a,;, calligraphic font (4’) to denote sets, and capital 
letters (X, Y) to denote random variables, both vector and scalar. If y is a vector with components y = (y),---, Yn), Ye 
denotes the subsequence y; = (yx,..-,y). If Y = (Y%4,.-., Yn) is a random vector with underlying probability distribution 
P, then P(Y; = yl.) denotes the marginal probability P(Ys = yx, Year = ψκ-.15.... Κ᾽ = yt). 

An n-element cover source will be represented using a random variable Xf = (X),...,X,) distributed according to 
some general distribution P( over 4", X 4 {1,...,N }. A specific cover object is a realization of Xf and will be 
denoted with the corresponding lower case letter zr? 2 (2,...,2n) € 4%". A stegosystem, with covers of fixed size 
τι, is a triple S, = (Xf, gn) Ψ(η)) consisting of the random variable describing the cover source, embedding mapping 
©("), and extraction mapping ¥™). The embedding mapping 6‘) applied to Xf induces another random variable 
Y" = (¥,...,Yn) with probability distribution Qe over X”. Specific realizations of Y;" are called stego objects and will 
be denoted y? Ξ (y1,..-, Yn). Here, 8 > 0 is a scalar parameter of embedding whose meaning will be explained shortly. 
The specific details of the embedding (and extraction) mappings are immaterial for this study. In particular, one only 
needs to postulate the probabilistic impact of embedding. 
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LSB embedding: +1 embedding: 


FIGURE 1.1. Examples of several embedding methods in the form of a functional matrix B. 


Assumption 1: [Mutually independent embedding] The embedding algorithm visits every cover element X; and, 
independently of all other elements, modifies it to a corresponding element of the stego object Y; with probability 


“πε; 1+fBej, ἰδὲπΞπΞ} 
(1.1) Qa(Ve = ἧχι =i) 2 b,5(8) = {1 TP BEI 
Beis otherwise, 
for some constants (c;,;) with οι.) 2 0 for i # j. Note that because }/,-y δὲ.) = 1, one must have δὲ, = — Σὺ κι ci, for 


each i € XY. The matrix (c;,;) reflects the inner workings of the embedding algorithm, while the parameter § captures 
the extent of embedding changes. It will be useful to think of 8 as the relative number of changes (change rate) or some 
function of the change rate. Also note that one can find sufficiently small 8, such that ὃ; .(β) > 0 for β € [0,89] and all 
ten. 

Because the matrix Bg = (b; ;(8)) does not depend on k € {1,...,n} or the history of embedding changes, one can say 
that the stego object is obtained from the cover by applying to each cover element a Mutually Independent embedding 
_ Operation (one speaks of MI embedding). The independence of embedding modifications implies that the conditional 
probability of stego object given the cover object can be factorized, i.e., ΟΡ ΙΧ ry) = ΠΕ, Q2(¥ilX:). 

Many embedding algorithms across different domains use MI embedding. Representative examples are LSB embedding, 
+1 embedding, stochastic modulation, Jsteg, MMx, and various versions of the F5 algorithm [29]. Examples of the matrix 
Bg for three selected embedding methods are shown in Figure 1.1. 

Next, an assumption on the cover source is formulated. 


Assumption 2: [Markov cover source] It is assumed that the cover source Xf is a first-order stationary Markov 
Chain over +, which will often be abbreviated as simply Markov Chain (MC). This source is completely described 
by its stochastic transition probability matrix A * (aj;) Ε RNY*%, ay; = Pr(X, = j|Xp-1 = i), and by the initial 
distribution Pr(X,). The probability distribution induced by the MC source generating n-element cover objects satisfies 
PO(XP = at) = P-Y( XPT! =a )a,,_,2,, where P\)(X)) is the initial distribution. Furthermore, it is assumed 
that the transition probability matrix of the cover source satisfies aj; > 6 > 0, for some 6 and thus the MC is irreducible. 
The stationary distribution of the MC source is a vector 7 = (7,...,7Nn) satisfying A = 7. In this report, it will 
always be assumed that the initial distribution P“)(X,) = a, which implies ΡΟ (ΧΑ) = a for every n and k. This 
assumption simplifies the analysis without loss of generality because the marginal probabilities P()(X,) converge to 7 
with exponential rate w.r.t. k (see Doob [17], equation (2.2) on page 173). In other words, MCs are “forgetting” their 
initial distribution with exponential rate. | 

Under the above assumption and the class of MI embedding, the source of stego objects no longer exhibits the Markov 
property and forms a Hidden Markov Chain (HMC) instead [79]. The HMC model is described by its hidden states (cover 
elements) and output transition probabilities (MI embedding). Hidden states are described by the cover MC and the 
output probability transition matrix B is taken from the definition of MI embedding. 

Unless stated otherwise, in the rest of this report gy denotes the probability measure induced by the HMC source 
embedded with parameter 6 into n-element MC cover objects. By the stationarity of the MC source, the marginal 
probabilities P™)(Xf+!) = P(X?) and ar ive) = Qs? (¥?) for all k. Sometimes the number of elements, n, will 
be omitted and P and Qs, will denote the probability distribution over cover and stego objects, respectively. 

The third assumption concerns the entire stegosystem S,,. Because it is known [61, 12] that steganographic capacity of 
perfectly secure stegosystems is linear in n, the SRL can only apply to imperfect stegosystems. 
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Assumption 3: [FI condition] It is assumed that the stegosystem S, = (Xf, ©") Ψ(π)) is not perfectly secure in 
the sense of Cachin [9] (the KL divergence Dx L(P™ ||Q%”) > 0). It is shown in Section (2), that for the special case of 
Markov cover sources Xj and MI embedding @(") this assumption can be equivalently stated in two different forms: 


(1) The pair (P), Qi?) does not satisfy so called Fisher Information condition, 


(1.2) vy εχ (P(X? =y2)>0) = (29 ΘΙ ΨὮ],.0 = 9): 
(2) There exists a pair of states (7,7) such that 
(1.3) P(X? = (i,9)) Καὶ Qa(¥7 = (i,j) for all B>0. 


For the sake of continuity, the PI only provides a few brief arguments. First of all, perfectly secure stegosystems must 
satisfy (1.2) because the Fisher information 
par, ᾿ 
᾿ 


appears as a coefficient in front οὗ β2 in the Taylor expansion of KL divergence Ὀκε(Ρῶ 9) w.r.t. 6 and thus 


1(0) = 


£Q5 (uy (y?)| p=0 must be zero whenever P(?)(y?) > 0. The opposite implication (zero Fisher information implies zero KL 


divergence) is not valid in general but holds for MI embedding as shown in Section (2). The second condition follows from 
the fact that second-order marginal statistics fully describe the first-order MC process and thus if (1.3) does not hold, 
then both cover and stego distributions are the same for all n (the stegosystem is perfectly secure). 

Finally, it should be stressed that Assumptions 1-3 are not overly restrictive and will likely be satisfied for all practical 
steganographic schemes in some appropriate representation of the cover. For example, in digital images it is unlikely that 
the distribution of each pixel depends only on its neighbor, but the dependency is likely to be spatially-limited. Then the 
image can be modeled as a Markov chain made up of overlapping pixel groups. Furthermore, if a stegosystem preserves 
the first-order statistics of a cover source, it is likely to be detectable by considering higher-order dependencies: the 
apparently-perfect stegosystem becomes imperfect when the cover is represented by pairs or groups of pixels, coefficients, 
or some other derived quantities. 


1.2. The square root law of steganographic capacity. This section contains the formulation and the proof of the 
main result, which states that the steganographic capacity of imperfect stegosystems with Markov covers and mutually 
independent embedding operation only grows with the square root of the number of cover elements. This finding has some 
fundamental implications in steganography and steganalysis. Probably the most remarkable one is that steganographic 
capacity exhibits quite different properties when compared with capacity of noisy channels or lossless compression. For 
example, while a mismatch in source model decreases the compression gain by a constant (the KL divergence between the 
source model and true source distribution), a cover model mismatch in steganography leads to vanishing capacity. 

The steganographer is at risk (w.r.t. some fixed tuple (Pi,, Piyp), with 0 < Pz, < 1 and 0 < Pi,, < 1— Pz,) if 
the warden has a detector with probability of false alarms and missed detection ἤρα, ΒΜ satisfying Pra < Pp, and 
Pup < Pyyp. 

Theorem 1: [The square root law of steganography for Markov covers] For a sequence of stegosystems (S,)°, 
satisfying Assumptions 1-3, the following holds: 


(1) If the sequence of embedding parameters B(n) increases faster than 1/,/n in the sense that limin—+oo & ’ 
then, for sufficiently large τι, the Steganographer is at risk for arbitrary tuple (ῬΈΑ. ΡΝ). 

(2) If B(n) increases slower than 1//n, limp-+0 ΠῚ = 0, then the stegosystem can be made €-secure for any € > 0 
for sufficiently large τι. This implies that the Steganographer is not at risk, for any tuple (Pz,, Pi,p)- 

(3) Finally, if B(n) grows asymptotically as fast as 1//n, limp oh = € for some 0 < € < οὐ, then the stegosystem 
is asymptotically Ce?-secure for some constant C. 


ja = 0; 


Proof: Each part of the theorem is proved separately. From the Kerckhoffs’ principle, the warden knows the distribution 
of cover objects P™ = Q’”. 

Part 1 [Steganographer at risk] Here, one needs to prove that the Steganographer is at risk w.r.t. any (P?,, Pf,p) for 
all sufficiently large n. This means that a sequence of detectors, D,,, needs to be constructed for the following composite 


FINAL REPORT FOR CONTRACT FA9550-08-1-0084 ENTITLED “TOWARDS STATISTICALLY UNDETECTABLE STEGANOGRAPHY” 8 


binary hypothesis testing problem 
Ho : β == ἢ 
Hy : β»0 


based on observing one stego object (one realization of a random sequence with distribution ay”). The error probabilities 


of these detectors are required to satisfy ρα < Pp, and Pyp < Piyp for all sufficiently large n. The test statistic for 
each detector D, is described next. 


Equation (1.3) in Assumption 3 guarantees the existence of pair of states (i,j) such that P(X? = (i,7)) # Qa(Y? = 
(i,7)) for all 8 > 0. Thus, the test statistic vg, for detector D, is defined as 


(1.4) van = Va|—— heli, j] = P(X? = .2}}} 
where —+hgli, j] is the relative count of the number of consecutive pairs (i,j) in an n-element stego object embedded 
using parameter § (In terms of indicator functions', ἢ [1,2] = 7f_; Ir y,=i,¥,,;=3}). Note that due to stationarity of the 
cover source, E | trheli al] = Q,(Y? = (i, 7)) for all β. 

The following is proved for the difference between the means of vg, under both hypotheses: 
(1:8) lim E\vg.n] -- E[vo,n] = 00 when V/nB — oo. 


Suppose, for a contradiction, that there exists K > 0, and a strictly increasing sequence of integers (n,,)°°_, for which 
(1.6) [E(vg nn) — E[von,|| « K for all m. 


If lim sup,,-,.9 8(Mm) = Bo > 0, then there exists a subsequence of (n»,)°°_,, which will be denoted the same to keep 
the notation simple, such that lim,,... 8(mm) = βο. For this subsequence, however, the difference 


Elvp.2m)—Elonm] = Vim|Qa(¥? = (5) -- P(X? =(4,3))| 
tends to co with m — oo because by (1.3) the absolute value converges to a positive value independent of m. This is, 
however, a contradiction with (1.6). 
If limm—oo 8(nm) = 0, one find the contradiction in a different manner. By the FI condition from Assumption 3, there 
must exist states (7,7) such that 45Qp=0(Y? = (i,j)) #0. From the Taylor expansion? of Q3(Y? = (i,j)) at 8 = 0 with 
Lagrange remainder and 0 < € < 1 


(1.7) Εἶν»... - Elona) = νῆπιβ teQano(¥? = (id) + 38-F5Qea(¥? = 0.2} 


which tends to oo as m -- οὐ when \/n7,8 — 00, which is again a contradiction with (1.6). In summary, E[vg.,|— E(Von] > 
oo holds for any sequence §(n) for which /n8(n) -- oo. 
Lemma 1 proved below shows that exponential forgetting of Markov chains guarantees that 
(1.8) Var|vgn| «Ὁ 
for some constant Οἱ independent of n and 8. Equations (1.5) and (1.8) are all that is needed to construct detectors Dp, 
that will put the Steganographer at risk for all sufficiently large n. The detector D, has the following form 
Van >T decide stego (8 > 0) 
Ygn <I decide cover (β = 0), 


where T is a fixed threshold. It is now shown that T can be chosen to make the detector probability of false alarms and 
missed detections satisfy 


Pra < FP A 

Pup < Pup 
for sufficiently large n. The threshold T(P,) will be determined from the requirement that the probability of the right 
tail, zs > T(Pp,), under Ho is at most Pz,. Using Chebyshev’s inequality, 


Var|v0,n| κι Cc 


Pra = Pr(von 2T)<Pr(\von|2=T) < ΤΩ 72° 


1For any two statements A, B, I, 4,8} = 1 if A and B are true, otherwise 1,4 5) = 0. 
2The Taylor expansion is valid since by its form the function Qa(Y st? = (i, 7)) is analytic. 
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Setting T = /C/Ph, gives Pra < Phy. 
Because of the growing difference between the means (1.5), one can find n large enough so that the probability of the 


left tail, zs < T(Ps,), under H, is less than or equal to Py,p. Again, one can use the ae s inequality with the 
bound on the variance of vg, to prove this: 


Pup = Pr(vgn<T(Pra )) = Pr(vgin — Elvgn — Von| < T(Pp,) — Elven — Von) 
C 
(Elvan — von] -- T(Pi,))” 
which can be made arbitrarily small for sufficiently large n because E[vg,,| — Εἶνο, αἰ + oo. This establishes the first part 
of the square root law. 


Part 2 [Asymptotic undetectability] Now it is proved that when ./nf§ — 0, then the KL divergence between the 
distributions of cover and stego objects tends to zero, 


n 
(1.9) 4,(6) = Der (P™IQD) = S> PM ΧΡ =vF)Ig a τς 
ν᾽ τ Qs (Y; 1) 
which will establish that the steganography is €-secure for any € > 0 for sufficiently large n. By the well-known connection 
between hypothesis testing and KL divergence [9], no nontrivial upper bound on false alarms and missed detections will 
be met, for large enough n. 

Using Taylor expansion of d,,(8) with Lagrange remainder at 8 = 0, dn(8) = d,(0)+d),(0)8 + S108) g2, whereO <u <1. 
This step is valid since under the above assumptions all derivatives of (normalized) KL που ν are continuous w.r.t. 6 
(the complete proof appears in this technical report {18]). The term d,(0) is zero because both distributions are the same 
when β =0. The term αἱ (0) is also zero because 


0 log 2 ΕΣ τς PO = vi) 


- Men - ὦ (1) τσπ ,,7χγλ} -- 
᾿ BB, joga ap (2 9 δ =sf)) = 


Ξ Pr(|vg,n — Elven -- Vo.n]| > Elvg.n — Vo,n] — T(PFa)) < 


LQ (yp =y7) 


d’ (0 τ 
Ὁ Oe (YP = 97) 


lim d,,(8) = lim = pe rent = yf) 


=} 

Finally, by Lemma 2 in Appendix there exists a constant C, such that +d} (B) <C for Be [(0, Bo] and all n. Thus, 
dn(B) < 4Cnp? + 0 when /nB -- 0. 

Part 3 [Asymptotic Ce?-security] To prove the third part of the square root law, one again expands the KL divergence 
d,(8) at 8 = 0 up to the third order with the Lagrange form of the remainder 


(1.10) dy(B) = ὦ (SO) ng? + 5 (Ξε 96}: 


for some 0 < uv < 1. According to [18], both normalized ΕΝ of the KL divergence, απ (0) and Σαμ(υβ), are 
upper bounded by the same finite constant C for all 8 € (0, Bo]. Since 8(n)./n > ε with πὶ — 00, B(n) > 0 and thus 
the expansion is valid. By the same reason, the second term in (1.10) converges to zero as n —> oo. From this result, one 
obtains the asymptotic bound on KL divergence in the form d,(8) < 4C¢? as was to be shown. 


1.3. Discussion. A genera] theme is now emerging in steganography literature: whether steganography is performed in 
a large batch of cover objects or a single large object, there is a wide range of situations in which secure capacity grows 
according to the square root of the cover size. Such results will likely hold for all stegosystems that are not perfectly 
secure in the sense of Cachin. It appears that the theory of hidden information is quite unlike the traditional theory of 
information. 

The result presented here is the first to allow dependence between the components of the cover. The square root law of 
steganographic capacity was proved for single covers under essentially three conditions: that they can be represented as a 
Markov chain, that the embedding operation can be modeled as independent substitutions of one state for another, and 
that the embedding scheme does not preserve all statistical properties of the cover. This applies to a very wide range of 
popular steganographic algorithms, in spatial and transform domains. The last condition is important because it is known 
that perfectly secure steganography, conveying information at a linear rate, can always be constructed if the cover source 
is perfectly understood [89]. However, it can be argued that digital media cover sources will never be perfectly understood. 
To explain why it makes sense to assume that the warden has perfect knowledge of the covers (necessary for construction 
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of the detectors in Part 1 of the proof), one needs to look at the cautious nature of Kerckhoffs’ principle: although the 
steganographer may believe that the warden does not know the complete cover distribution, there is always the risk that 
the warden knows more than the steganographer does. For example, she might know more about dependencies between 
cover elements. Since the steganographer cannot say for certain how much the warden will know, they must assume the 
worse case: complete knowledge. 

The formulation of the theorem parallels that of [1]: embedding at ἃ rate faster than \/n leads to eventual detection, - 
whereas embedding at a rate slower than ,/n leads to eventual e-security. At rates A,/n, the stegosystem is e-secure. This 
does not refer to embedding at a diminishing rate in a single cover object (which would be a different theorem, and is an 
avenue for future research). The quantity $(n) describes a constant embedding rate throughout an object of size n: one 
could think of the function 6 as giving a strategy describing how much data can be hidden in an object of each size. The 
SRL says that over-ambitious strategies lead to easier detection in larger objects, cautious strategies lead to more difficult 
detection in larger objects, and quantifies the boundary. 

The square root law has some important implications in steganography and steganalysis. Most significantly, the simple 
fact that capacity is sublinear means that a true steganographic channel, with positive rate, cannot be constructed unless 
the cover source is known perfectly. For steganalysis, the SRL explains in part why the same relative payload can be 
detected more accurately in large images (however it is not the whole explanation: larger images tend to have more smooth 
gradients and less noise, and these factors also influence detection accuracy). Thus, when benchmarking steganography, 
the distribution of image sizes in the database influences the reliability of steganalysis and makes it more difficult to 
compare the results on two different databases. To resolve this issue, one might switch to measuring the payload in bits 
per square root of pixel, an idea further explained in Section 2.3. 

Finally, it should be emphasized that the square root law of capacity relates to the number of changes caused by the 
embedding process, and not to the size of the information transmitted. With adaptive source coding methods (even simple 
matrix embedding based on Hamming codes will do) the number of bits of information which can be conveyed, by making 
at most c changes in n cover locations, is O(clog(n/c)). Therefore the SRL implies an asymptotic information capacity 
which is O(,/nlog n) (i.e., still sublinear), in the absence of perfect steganography. 


1.4. Proofs. In this appendix, the PI includes two auxiliary lemmas needed in the proof of the SRL in Section 1.2. 
Lemma 1: Let vg, be the random variable defined in (1.4) for a fixed value of the parameter 6 and cover size n. The 
variance of this random variable can be bounded by a constant C for every value of 8 and n 


4C,VB,Wn Var([vg.n)] < C. 


Proof: From the definition of vgn 


4 2 n=1 2 τι--ὶ 2 8:2 
AN Varivpn] = Ε[( 2 Lystieayy) | - ΕἸ lisa] SD, Varllygr acy] + 


(1.11) +2/ τ Elly +1474 ΠΝ ΕἸ γε Εν μαι, ce] + 2n. 
k+1<k 


In the last sum, the PI bounded all terms for k = k —1 by 1 and thus obtained the term 2n in the last inequality. For 
any event A, 


Var|I4] = Pr(A) -- Pr(A)? = 4-(Pr(A)- 3)’ <3 


so the first term is bounded by 25+. 
Finally, one finds an upper bound on the sum in (1.11) in the form of C2n for some positive constant C2. This will 
2 
give us the proof because Var|vgn| < mene ((n — 1ὴΣ + 2Cgn + 2n) < 4(4 + 2C2 + 2), and Tmoir <4 for n > 2. Thus 
C = 8C2 + 9. 
The PI starts by showing that 


Qa(¥i*! = ((,2), YF"! = (69) — On(VE" = (2) Qe(¥e"" = 3) 
(1.12) = [avi = DIVE = 9) - Qa(¥E"" = GA) [OnE = 4.9) < N20, 
Cel : 


<N2pk-k-2 
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for some 0 < p < 1 and k+1 < k (N is the number of all possible states of the MC). In other words, the HMC is 


exponentially forgetting its initial condition. Then, one will be able to ai the sum in (1.11) by N? Σ ἢ... 1-2 pk-k-2 — 
ΝΣ Σ Te Ὁ ἐπ: « N? ess = Νῆ(π-- 2) τς s a Thus, C2 = ΕΞ, because Qa(Y,'*) = (i,j) <1. 


The term Qa( ae = (i, me in (1.12) can be written as 


(1.13 Qa (¥E*? = (4,3) = Σ On (VA = AXE = ᾧ 2) P(XEY =O) = DY 0,0, P(XE = 79). 
(3,3) (4,3) 


The term Ὁ 8 sa = (i, j)|Yf*? = (i,7)) in (1.12) can be written as 


Qs(¥e" = ime = (i,7)) 
Qs (¥** = (3,3) 
je a4 εἴὰ u- ΕΣ 
i) La, 3) 93,03 jj ὃ; 765 gP(X, +h = (1,9), X,~ a (?, 3)) 
Qa(¥%,"? _ (i,j) 
Finally, P(X?* = (7), Xp" = (i,j)) can be factorized as a = (1, 7)|Xgt? = (1,3) P(X_*! = (4,3)). Due to 
the Markov property of the random variable XAT, ΡΙΧΕ" --᾿ (ὦ ΧΡῊ = (ὦ. }))} = P(xitl = (?,9)|Xk41 = J). For each 
pair of indices (1,9), the PI defines the index jmaz = arg max; P(X; κει ς (1,9) |Xnoi = j). Then 


Di 3) δὶ t by .P(xer! _ (2, Bi They -_ Σύ. LG 3) δ᾽ Ὁ) gP(Ky™ = (i, 9)) 
Qe(Y,*? = (3,9) 

5 b: b; .P(XE*) (ὦ, 2} Χκαι T Fellas }o 

(3,9) 

Now, one can combine (1.13) and (1.14) to prove (1.12) as 


Qa(¥i*? = (ὦ, 2] δὶ = (6.5)) - Oa(¥E = (4,9) 


(1.13),(1.14) ᾿ ἡ Ξ ’ ms 
Σ΄ δὴν  (P(XE = ᾧ Χο = maz) — P(XP*? = (..2}}} 
(1.2) 
(1.15) = Σ ὑπό ΡίΧεμ = χε τ Ὦ Ρ(Χι = ἔχει = maz) - P(Xy τ- ἢ) < Νῆρ᾽ 5:5 
(3,3) 


It is a well known result in MCs that the absolute value of the term P(X, = 1|X441 = Jmaz) — P(X, =7) in (1.15) can 
k-k-2 


Qe(Ver? = ὦ] YE" = 43) 


= (#). 


(#) < 


(1.14) =" 


be bounded by p (exponential forgetting), for some constant 0 < p < 1. This is because the MC is irreducible due 
to the assumption ας, > ὃ (see equation (2.2) on page 173 in Doob [17]). This bound does not depend on jmaz. The final 
bound does not depend on § because δι. 51 and δ᾽; ἃ | 


Lemma 2: Let ἀ,(β) = Dx 1(P™1|Q9) be the KL divergence between n-element cover and stego sources as defined 
in (1.9). Then, 


“fo, IC > 0,¥B € [0,fo],¥n —"(A) « δ. 


In other words, the second derivative of the normalized KL divergence, 4d" (B), can be bounded by a constant C for each 
τι and β. And this bound does not depend on n or 8. 


Proof: The problem of bounding normalized derivatives of KL divergence for the case of HMC was studied by Mevel et 
al. [60]. Their results, namely Theorem 4.4 and Theorem 4.7, however, cannot be directly applied to our case because our 
assumptions are different. In particular, Assumption C on page 1124 is not satisfied because zeros are allowed in matrix 
B. Motivated by this work, the PI derives a more general result about the normalized KL divergence and its derivatives 
(see the report in [18]. Intuitively, one can expect the normalized KL divergence to be arbitrarily smooth and bounded 
due to the smooth transition from P to Ὁ and the fact that d,(0) = 0. The main result of the report, formally stated 
in [18], Theorem 3, says that every derivative of +d,(f) w.r.t. 8 (and the function 4d,(8) itself) is uniformly bounded 
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and Lipschitz-continuous (or simply continuous) on |0, βο]. These properties are independent of n > 1. From this fact, 
Lemma 2 can be obtained as a special case. This result also allows us to expand the KL divergence into a Taylor series 
with respect to β. 
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2. COMPLETE CHARACTERIZATION OF PERFECTLY SECURE STEGO-SYSTEMS WITH MUTUALLY INDEPENDENT 
EMBEDDING OPERATION 


In steganography, the sender and receiver communicate by hiding their messages in generally trusted media, such as 
digital images, so that one cannot distinguish between the original (cover) objects and the objects carrying the message 
(stego objects). Formally, the security of a stego-system is evaluated using the Kullback-Leibler divergence between the 
distributions of cover and stego objects [9]. Systems with zero KL divergence are called perfectly secure. 

Formally, a stego-system is a combination of an embedding algorithm and a cover source. The vast majority of practical 
stego-systems hide messages by modifying individual cover elements using mutually independent embedding operations, 
e.g., LSB and +1 embedding, F5 algorithm, perturbed quantization, MMx, stochastic modulation, and many others 
(see [37] and the references therein). 

This section of the report provides a complete characterization of perfectly secure stego-systems for the class of embed- 
ding algorithms that employ mutually independent (MI) embedding operations, which is the class of operations for which 
the SRL was proved in the previous section. The cover distributions of all perfectly secure systems form a linear vector 
space spanned by distributions determined by the embedding operation. Moreover, perfect security (zero KL divergence) 
is equivalent to satisfying a simple condition related to Fisher information. This result suggests that Fisher information 
can be used as an equivalent descriptor of steganographic security. 


Definition 1. Steganography is perfectly secure iff 


4(8) # Dex (P\IQ@s)= Σ᾽ Plvt)log = Stes τὸ 


yrpEexr 


or €-secure if α(β) < ε. 


For better flow, the PI reminds that the impact of embedding with parameter 6 € [0,60] on the k-th element can be 
captured using the matrix b;,;(8) = Pr(Yx = j|Xx = ἡ) = δι.) +Bci,;, for some constants ¢;,; > 0 fort γέ 7, cs = — 54,3, 
where δι.) is the Kronecker delta. In a matrix form, Bg = 1+ SC, where Bg = (b;,;(8)), 1 is the identity matrix, and 
C 4 (c,,;). It is also assumed that embedding operations are mutually independent, ΡΥ ΧΡ) = []z_, Pr(Vs|Xx). By 
the definition of 6; ;, the matrix Bg is stochastic, δὰ δι.) = 1. Finally, it is assumed that 0; ;(8) > 0 for all 8 € (0, Bo). 
The matrix Bg represents an embedding algorithm with MI embedding operation (simply MI embedding). Examples of 
practical embedding methods that fall under this framework are shown in Figure 2.1. 

To simplify the language in this report, one will speak of security of a cover source w.r.t. a given MI embedding meaning 
that the cover source is perfectly secure w.r.t. B, if the resulting stego-system is perfectly secure. It does then make sense 
to inquire about all possible perfectly secure cover sources w.r.t. MI embedding with matrix Bg. 


2.1. A few known results from ergodic theory. Some results from the theory of ergodic classes are now reviewed [17]. 
They will be later applied to the stochastic matrix Bg. For states i, 7 € Δ΄, 7 is called a consequent of 7 (of order k) (i + 7) 
iff 3k, (BS): #0. State i Ε ¥ is transient if it has a consequent of which it is not itself a consequent, i.e., 37 € Δ΄ such 
that (i > 7) > (7 #1). Furthermore, i € Δ΄ is non-transient if it is a consequent of every one of its consequents, Vj € Δ΄," 
(i + 2) = (7) > 1). The set X can be decomposed as Δ΄ = FU &, U---UE,, where F is the set of all transient states 
and €4, a € {1,...,k}, are so called ergodic classes. Two non-transient states are put into one ergodic class if they are 
consequents of each other. 

Let matrix Bg have k ergodic classes. Then, there exist k linearly independent left eigenvectors, denoted as 1)... , #(*), 


of matrix Bg corresponding to eigenvalue 1, called invariant distributions. If r*)Bg = 2), for some a € {1,...,k}, then 
πίο) > 0 for allt € &, and ni) = 0 otherwise. Every other 7 satisfying Bg = 7 is obtained by a convex linear 
combination of {7')|a € {1,...,k}}. For a complete reference, see [17, Chapter V, 82]. The set of ergodic classes for 


matrix Bg depends only on the set {(i, j)|b;,;(8) τέ 0). Since 6; ;(8) = 0 iff c;,; = 0 for ὁ 4 7 and b;;(8) > 0 for B E€ (0, Bo}, 
the structure of ergodic classes does not depend on 8. Moreover, if 7Bg = 7 for some 8 > 0, then πῷ = 0 and thus all 
invariant distributions are independent of 8, because Bg: = πὶ + β'πῷ = πὶ = 7. By this reason, the index β will be 
frequently omitted. 


2.2. Perfectly secure cover sources under mutually independent embedding operation. The matrix B repre- 
sents an arbitrary MI embedding with k ergodic classes ξὰ and invariant distributions 7), a € {1,..., k}. The following 
example describes a construction of perfectly secure cover sources w.r.t. B. 
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LSB embedding: Η = 1-8 ΗΙ -- 8 


| e@ Ὁ» 
aan 


transient state QO non-transient state 


FIGURE 2.1. Examples of several embedding methods and their ergodic classes. 


Example 2. [Perfectly secure cover sources] Let P) be a probability distribution on 2-element cover objects defined as 
P?)(X? = (i,9)) = mn for some a,b € {1,...,k}. Then P®) is a perfectly secure cover source w.r.t. B because 


&@B(2)(¥12 = (,3)) = (Sibi, sP(X1 = ἢ) ($7903, ΡΧ2 = 3) 
& =(2(a)B)i(m(b)B)j = wi(a)nj(b) = P(2)(X12 = (i, j)), 


and thus both distributions P), and gy’ are identical, which implies perfect security. Since this construction does 
not depend on the particular choice of a,b € {1,...,k}, one can create k? perfectly secure cover sources w.r.t. B. The 
probability distributions P‘) obtained from this construction are linearly independent and form a k?-dimensional linear 
vector space. By a similar construction, one can construct k” n-element linearly independent perfectly secure cover sources 
w.r.t. B. 


It is shown next that there are no other linearly independent perfectly secure cover sources w.r.t. B. 


Theorem 3. [Mutually independent embedding] There are exactly k" linearly independent perfectly secure probability 
distributions P on n-element covers. Every perfectly secure probability distribution P w.r.t. B can be obtained by a convex 
linear combination of k" linearly independent perfectly secure distributions described in Example 2. 


Proof. It is sufficient to prove that there cannot be more than k” linearly independent perfectly secure probability distri- 
butions P on n-element covers. The proof is shown for n = 2 and then its generalization is discussed. 

Define the following matrices P = (p;,;), pi,j = P(X? = (i,7)), and Ὁ = (σι), ai,3. = Qs(Y? = (1,9)). By defininition 
of MI embedding, 


= >> Qel(¥? = (i,7)|X? = (v,w)) P(X? = (υ,ω)) 
(v,wjEex? 


= δ᾽ δυιδω)ρυω . 


v,wEerX 
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Define matrix D £ (d.,2 v2) of size N* x N?, where dy?2.y? = buy ,v; 5u2,v2- If p is defined as one big row vector of elements 
pi,j and similarly g, then assuming perfect security of cover source w.r.t. B (P = Q), one obtains g= pD = p and thus p 
is a left eigenvector of D corresponding to 1. Matrix D is stochastic and thus it is sufficient to show that it has k? ergodic 
classes. 

This is achieved by showing first that 


(2.1) us yo υΐ & (uy se v,) and (ug ἂν v2), us, υΐῖ ε X?. 
By u? my v? it is meant that v? is a consequent of u? of order m in terms of matrix D. If u? ἂν; v?, then there exist 
m — 1 intermediate states ;w?,...,m—1 w?, such that αἱ, ἀν ὦ, Ἄχ “αἰ χω, > 0. Since dy2\y2 = bu,,v,bus,v2, this implies 


the existence of both paths u, (my) v; of order πὶ, i = 1,2. The converse is true by the same reason. 


( ( 


It is now shown that € x &, a,b € {1,...,k} are the only ergodic classes. If ὦ] my) v,; and u2 my) v2, then 


u? eat α) v? for all uj,v; € € and ug, v2 Ε Ep, because the path from ὧς to vu; can be arbitrarily extended by adding 
self loops of type j —> j since all diagonal terms b,,; are positive and thus by (2.1) one obtains u? ee v?. Finally by 
μοῦ € Eg and ug, v2 € Ep, vi — ὡς and by the same argument ve - ur, and therefore E€, x € are ergodic classes. Any 
other state uz €E,x FUFxE,UF x F must be transient w.r.t. D, otherwise by (2.1) a contradiction with ὡς € F for 
some ἃ is obtained. 

This proof can be generalized for n > 3 by proper definition of matrices "ἢ, Ὁ, and D. In general, matrix D has 
size N" x N™. By similar construction one obtains k” ergodic classes of generalized matrix D. However, k” linearly 
independent distributions are already known. 0 


2.3. Perfect security and Fisher information. Here, it is shown that for stego-systems with MI embedding perfect 
security can be captured using Fisher information. From Taylor expansion of KL divergence, for small 8, d(8) = 5 B71(0)+ 
O(8*) where I(0) = 07d(8)/08?|,~0 is the Fisher information w.r.t. 8. If for some stego-system d(f) = 0 for β € (0, Bo), 
then J(0) = 0 from the Taylor expansion. Even though the opposite does not hold in general, the PI will prove that for MI 
embedding zero Fisher information implies perfect security. In other words, a stego-system with MI embedding is perfectly 
secure for 8 Ε (0, βο] if and only if (0) = 0. This supplies a simpler condition for verifying perfect security than the KL 
divergence. Fisher information also provides a connection to quantitative steganalysis because 1/J(f) is the lower bound 
on variance of unbiased estimators of 8. Moreover, J(0) could be used for comparing (benchmarking) stego-systems. 
First, he condition (0) = 0 is reformulated: 


Proposition 4. Let P and Qs, be probability distributions of cover and stego objects with n elements embedded with 
parameter 8. The Fisher information is zero if and only if the Fl-condition is satisfied 


(2.2) wr Ex” (P(XP =P) > 0) > (F.Qa(v)|pu0 = 9): 
Proof. The second derivative of α(β) at 8, d’(G), can be written as 
ny ( ὠβίν) ply?) ‘) 
2. I =— P "ον τ ign) 
ἘΝ (5: ὥς ae Gr (onan) | 


where Q2(yt) = ἕο (νυ). By P(y}) = Qs=o(y{), the first term in the bracket in (2.3) sums to zero at β = 0, and thus 
I(0) is zero iff Q’(y?)| βωο = 0 is zero for all y € Y” for which P')(y™) > 0 as was to be proved. Here, it is assumed 
that the KL divergence d(8) be continuous w.r.t. 8, which is valid by the construction of the matrix B. O 


The next theorem shows that the FI condition (2.2) is equivalent with perfect security for MI embedding. 


Theorem 5. {Fisher information condition] There are exactly k" linearly independent probability distributions P on n- 
element covers satisfying the FI condition (2.2). These distributions are perfectly secure w.r.t. B. Every other probability 
distribution P satisfying (2.2) can be obtained by a convex linear combination of k” linearly independent perfectly secure 
distributions. 


Proof. From Example 2, k” linearly independent perfectly secure distributions are available. By Taylor expansion of d(), 
these distributions satisfy the FI condition, because d(S) = 0 > I(0) = 0. It is sufficient to show that there cannot be 
more linearly independent distributions satisfying the FI condition. 
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Similarly as in the previous proof, the theorem is reformulated as an eigenvector problem so that one can use ergodic 
class theory to give the exact number of left eigenvectors corresponding to 1. Again, the proof is first carried out for the 
case,n = 2. | 

If P satisfies (2.2), then the linear term in the Taylor expansion of Qs(y?) w.r.t. 8 is zero. By the independence 
property, (Q(yf|z?) = [];_; Q(yilzi)), and the form of matrix αὶ (Bg = 1+ SC), condition (2.2) has the following form 


dQ 3(y?) ᾿ = oF 
ΦΎΣΙΝ, = lim P(zt) = 1] Qa (mlz) 
dB \s=0 8-0 ΡΣ Ὰ 1 4β I] 
(2.4) = δ cay P(ti,y2) + δ΄ Ceay2P(yi, 22) = 0. 
ry EX T2EX 


Define matrix P = (p;;) as pi; = P(X? = (i,j)) and represent it as a row vector p. If one defines matrix Ὁ 4 (9u2,n2) of 
size N? x N? as 


Cu,v, if uy #v, and ug = ve 


(2.5) dy2y? ΞΞ ὁ Cugv. ifuy =u and 2 τῇ v2 
0 otherwise, 
and a diagonal matrix G = (9u2,v2) of size N? x.N? as Ju2,u2 = —Cur,ui — Cuz,u2, then equation (2.4) can be written in 


a compact form as pD = pG. Both matrices D and G are non-negative by their definitions. Let H = I + 7(D—G). 
Using y = (max,2¢72 9u2.u2) πὰ the matrix H is stochastic and pH = p iff pD = pG. Thus, (2.2) is equivalent with an 
eigenvalue problem for matrix H. 

First, observe that for 1 # 7 cj; > 0 iff hiia),(j,a) > 0 for all a € Δ΄, because by (2.5) hiiay.(5,.a) = γά(ι,α),(),.) = Yeu (the 
first case when ug = v2). Similarly, for i # 7 ci; > 0 iff hyai),(a,3) > 0 for all a € X (the second case when u; = v,). This 
means that i -- 7 iff (?,a@) - (j,a) w.r.t. H for all a € Δ΄ and similarly i > 7 iff (a,1) + (a,j) w.r.t. H for alla € ζ΄. 
This can be proved by using the previous statement. By this rule used for a given u? € & x &, one obtains u? > v? 
and v? -- u? for all v? € € x & and thus €, x &; is an ergodic class w.r.t. H. Te PI now shows that there can not be 
more ergodic classes and thus all k? of them are found. If εὖ € F x €, then αὖ has to be transient w.r.t. H, otherwise 
one would obtain contradiction with Δ € F. This is because the only consequents of order 1 are of type (i,a) + (j,a) 
or (a,i) — (a,j), therefore if αὖ € F x €, one chooses v? Ε X x E, such that v; A u, (u, is transient and thus such υἹ 
must exist). State u? must be transient. otherwise u? <+ v? implies u; +> v1, which results in contradiction with vy A uy). 
Similarly for u? € € x FUF x F. This proof can be generalized for n > 3 by assuming larger matrices P, D, G, and H, 
obtaining exactly k” linearly independent perfectly secure distributions satisfying the FI condition. C 


Next, the PI discusses the structure of the set of invariant distributions for a given MI embedding and shows how to find 
ergodic classes from matrix B in practice. By Theorem 2.1 from [17, Chapter V, page 175], this can be done by inspecting 
the matrix limit M = (mj,;) = limp. ὦ Σ᾿... Β΄. According to this theorem, state i is non-transient iff m;,; > 0 and is 
transient otherwise. Two non-transient states 7,7 € X are included in one ergodic class if m;,; > 0. All rows of the matrix 
M corresponding to states in one ergodic class €, are the same and equal to the invariant distribution of this class, 1°). 

This section is closed with a short discussion of two practical embedding algorithms. For the F5 embedding algo- 
rithm [90], the set of states 7 = {—1024,...,1024}. By the nature of the embedding changes (flip towards 0), there is 
only one ergodic set €; = {0} and F = X \ {0}. Thus, there is only one invariant distribution, πὸ = 1 and zero otherwise. 
Obviously, no message can be embedded in covers with this singular distribution. 

For the case of LSB embedding over 1 = {0,...,255}, one has €, = {2a,2a + 1} for a € {0,...,127}, F = 0 and 
nia) = κα 1 = 5 and zero otherwise (LSB embedding cannot be detected in images with evened out histogram bins). 
Thus, sources realized as a sequence of mutually independent random variables with such a distribution are the only. 
perfectly secure sources w.r.t. LSB embedding. Figure 2.1 shows examples of matrices B and ergodic classes of several 
known algorithms with MI embedding operation. 


2.4. Application to Markov cover sources. In this section, the results obtained so far are reformulated for a special 
type of cover sources that can be modeled as first-order stationary Markov Chains (MC). The results play a key role in 
proving the square root law of steganographic capacity of imperfect stego-systems for Markov covers [27, 54]. 

First, for stationary cover sources Theorem 3 leads to this immediate corollary. 


Corollary 6. There are exactly k (instead of Κ᾽} linearly indpendent perfectly secure stationary cover sources. These 
sources are 1.t.d. with some invariant distribution πα, α € 1,...,k. 
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The next corollary states that in order to study perfect security of n-element stationary MC covers, it is enough to 
study only 2-element covers. 


Corollary 7. Let P, Qs be first-order. stationary MC cover distribution and its corresponding stego distribution after 
MI embedding with parameter 8. For a given n > 2, an n-element stego-system is perfectly secure iff the corresponding 
stego-system narrowed to 2-element cover source is perfectly secure for some Bo > 0: 


(2.6) 360» 0, νυν eX? P(X? = yf) = OG) (XT = v7). 
Moreover, the FI condition for Markov sources simplifies to 

d 
(2.7) Vy? ε x? (P(x? - 99)» 0) ap (508 οἷν sao = 0). 


Proof. Because invariant distributions do not depend on 8, Equation (2.6) must be valid for all 8 > 0 once it holds for 
some βὺ (see the arguments at the end of Sec. 2.1). By Corollary 6, if the stego-system is perfectly secure (n > 2), then 
the cover source is i.i.d. with some invariant distribution w.r.t. MI embedding and thus (2.6) and (2.7) hold. On the other 
hand, if (2.6) and (2.7) hold for n = 2 and stationary cover source, then this cover source is i.i.d. with one of k invariant 
distributions. This completes the proof since 2-element marginal is sufficient statistics for a first-order stationary MC. 0 


2.5. Discussion. Most practical stego-systems for digital media embed messages by making independent changes to 
individual cover elements. If the embedding operation is fixed, one may inquire in which cover sources the embedding 
is statistically undetectable in Cachin’s sense. The main contribution of this part of the report is a complete geometric 
characterization of such sources. Using the theory of ergodic classes, it was shown that all cover sources that are perfectly 
secure with respect to mutually independent embedding form a vector space spanned by invariant distributions determined 
by the embedding operation. 

Additionally, it was shown that perfect security of stegosystems with mutually independent embedding is completely 
captured using Fisher information formulated in Section 2.3 as the FI condition. This result not only provides a simpler and 
equivalent condition for perfect security, but it finds further applications in steganalysis. For example, Fisher information 
could be used for benchmarking such stego-systems, a direction pursued in Section 2.3. Moreover, Fisher information 
provides fundamental lower bounds on the variance of unbiased estimators of the change rate, which connects our results 
to problems in quantitative steganalysis. Finally, the FI.condition plays a key role in proving the square root law of 
steganographic capacity of imperfect stego-systems [27, 54] (Section 1.2). 
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3. EXPERIMENTAL VERIFICATION OF THE SQUARE ROOT LAW 


The PI conducted extensive experiments to validate the square root law described in the two sections above. Payloads 
will be embedded, using a number of different embedding methods and various payload lengths, into cover images of 
different sizes. Then state-of-the-art steganalysis methods will be applied to these images while looking for a square root 
relationship. The investigation is first directed to spatial-domain embedding and then to DCT-domain steganography for 
which there are some additional challenges. 


3.1. Spatial domain. There are two difficulties to overcome in testing the theoretical result of Section 1. The first 
is the caveat that capacity is a square root law all other things being equal. Other literature on the benchmarking of 
steganalysis [5, 7] has shown that there are cover properties other than size ~ local variance, saturation, prior image 
processing operations — which significantly affect the detectability of payload,-and it is not possible to control or even 
determine them all. Therefore one cannot use sets of differently-sized covers from different sources to estimate how 
capacity depends on size: variations in the other properties may invalidate the results. Neither can one generate small 
cover images by downsampling large ones, because downsampled images have.a higher semantic density so, usually, higher 
local variance. The solution is to use a single set of large covers and repeatedly crop down to smaller images. In an 
attempt to preserve other image characteristics, the cropped region can be chosen so that the average local variance (here 
measured by average absolute difference between neighbouring pixels) is as close as possible to that of the whole image. 
The image libraries available to the PI are not large enough to partition them into disjoint sets for cropping to different 
sizes, So one may observe correlation between the content of the different-sized cropped images, but this is not expected 


- to cause significant effects in the experiments. 


The second difficulty is to define “capacity.” One can set a level of detection risk which the steganographer is prepared 
to accept, but (even apart from the fact that the level itself will be arbitrary) how to measure detectability? As discussed 
in [52] and [68], there are many different detection metrics found in the literature. For these experiments the following 
three metrics will be considered, one standard and one very recent: 


(1) The minimum sum of false positive and false negative errors for a binary classifier Pe = ὁ min(Pra + Pup) (for 
comparability with other measures, 1 — Pg is used); 

(2) Directly from the observed cover and stego distributions of steganalysis features, a recently-developed measure 
called Maximum Mean Discrepancy (MMD). Its key features are described in Section 3.4. 


In each case, higher values denote lower security. 

Before testing this hypothesis empirically, the PI returns briefly to the definition of capacity. It is not quite correct to 
speak of capacity as a bound on the size of payload because it is not payload itself which is detected by steganalysis. It 
is the changes induced by embedding which are detected, and capacity is more properly given by a bound on permissable 
changes; in simple embedding schemes where the changes are of fixed magnitude, it is the number of changes one should 
measure. This difference is important because of the existence of adaptive source codes [30], which can exploit freedom of 
choice of embedding locations to reduce the number of changes required. 

The first series of experiments was performed on never-compressed cover images. A set of 3000 images was downloaded 
from the NRCS website [63]: apparently scanned from film in full colour, these images vary slightly in size around 
approximately 2100 x 1500 pixels. The images were downsampled to a larger side of 1024 pixels, and reduced to grayscale: 
the same set of images has been used by a number of steganalysis researchers. Nine sets each of 3000 grayscale cover 
images were then created by repeated cropping, selecting the crop region best to match the local variance of the original, 
to sizes 100 x 75, 200 x 150, ..., 900 x 675. 

Random payload was embedded using simple LSB replacement (for payload smaller than maximum a random selection 
of embedding locations was used). Three different strategies were selected for choosing the payload size according to cover 
size: embedding a fixed-size payload in all cover sets, embedding payload proportional to the square root of the number of 
cover pixels, and embedding payload proportional to the number of cover pixels. For each option, three different constants 
of proportionality were tested. 

The method in [53] gives the currently-known best steganalysis of LSB replacement in never-compressed images, and it 
was applied to each set of covers and stego images. The accuracies of the resulting detector for payload, as measured by Pr 
and MMD, are displayed in Fig. 3.1, along with 90% confidence intervals obtained using a simple resampling bootstrap. 
These experiments are in line with the theoretical predictions: whichever detectability metric is used, fixed-length payload 
becomes harder to detect in larger covers, fixed-proportion payload becomes easier to detect, and payload proportional 
to square root of cover size is (approximately) of constant detectability. At least these results suggest that square root 
capacity is much more plausible than proportionate capacity. 
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FIGURE 3.1. Detectability (y-axis: 1 -- Pg and MMD on a log scale) as a function of cover size N (z- 
axis) and payload size. 90% bootstrapped confidence intervals are indicated. Left, fixed payload size. 
Middle, payload proportional to VN. Right, proportional to N. LSB replacement steganography in 
never-compressed cover images, detected by method of [47]. 
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FIGURE 3.2. Detectability as a function of cover size, cf. Fig. 3.1. LSB replacement steganography in 
previously JPEG-compressed digital camera images, detected by method of [48]. 


The same experiments were repeated with a set of 1600 images taken by a Minolta DiIMAGE ΑἹ camera in raw format 
at a resolution of 2000 x 1500, subsequently converted to grayscale and subject to JPEG compression (quality factor 80). 
The images were cropped to 16 different sizes between 100 x 75 and full size, again selecting the crop region to match the 
average local variance of the original. When cover images have been. previously compressed, different detectors for LSB 
replacement have better performance than that in [53], so the PI used the Triples detector of [48]. 

Charts analogous to those in Fig. 3.1, for the compressed cover images and Triples steganalysis, are displayed in Fig. 3.2 
and one can draw similar conclusions: secure payload is certainly not constant, nor proportional to cover size, but appears 
to be approximately proportional to the square root of the cover size. More visible in this second set of experiments are 
artefacts in the charts for very small cover sizes, but these are to be expected if the theoretical results are only asymptotic 
for large covers. 

Finally, the PI tested an alternative method of spatial-domain LSB embedding known as LSB matching, or +1 embed- 
ding. It does not have the structural flaws of LSB replacement, and seems much more difficult to detect. For the detector, 
the method known as the adjacency HCF COM found in [49] was used. This detector is still quite weak: payloads as 
smal] as those in the previous two experiments are undetectable, so the PI had to increase the payload sizes considerably. 
As a result, it was not possible to fit the payloads into very small covers (one cannot embed more than 1 bit per pixel 
using LSBs). The same 3000 never-compressed scanned images were used for the first experiment, cropped down to ten 
sizes between 360 x 270 and 900 x 675. The resulting charts are displayed in Fig. 3.3. 

Observe that the detector’s performance remains very low: Pg not much below 0.5 (which corresponds to a random 
detector) and MMD is near to zero (corresponding to identical distributions of cover and stego features) and, because one 
is digging in the detector noise, the bootstrap confidence intervals are wider. However, similar features are still apparent: 
falling detectability in larger covers when the payload is fixed and rising detectability when the payload is proportional to 
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FIGURE 3.3. Detectability as ἃ function of cover size, cf. Fig. 3.1. LSB matching steganography in 
never-compressed scanned images, detected by method of [49]. 
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FIGURE 3.4. Capacity (y-axes, determined by limit on Pg) as a function of cover size (z-axis), log-log 
scale, with best-fit. trend lines. Three different steganography/steganalysis methods displayed. Left, 
LSB replacement in never-compressed images detected by [53]; middle, LSB replacement in previously 
JPEG-compressed images detected by [48]; right, LSB matching detected by [49). 


cover size. When the payload is proportional to the square root of the cover size, the detection metrics are approrimately 
constant, although there is a suggestion that the detectability may be gradually decreasing. 

To investigate more precisely how capacity depends on cover size the PI performed additional experiments: fixing on 
just one detection metric a bound was set on the risk to the steganographer (a minimum value of Pg) and determined the 
largest payload for which the detection bound can be met. This was accomplished by embedding 100 different payload 
sizes in each of the cover sets, measuring Pg for each combination and using linear interpolation to estimate Pr for 
intermediate payloads. Denoting cover size (pixels) by N and capacity (payload bits) by M, one can plot M against N 
on a log-log scale: if there is a relationship of the form M « N* then the points should fall in a straight line with slope e. 

Fig. 3.4 displays the results for each of the three detectors and cover sets in the experiments, with two different 
thresholds for Pg (in the case of LSB matching, one must set a very high threshold for Pg because the detector is so 
weak). In each case, a straight line fit is determined by simple linear regression. When capacity is measured in this way, 
it does indeed appear to follow a relationship M « N*, with values of e very close to 0.5. Even the line corresponding to 
Pz = 0.45 with the LSB matching detector would have slope close to 0.5 if the data points from the smallest image sets 
were discounted. Unfortunately one cannot use the standard least-squares tests for whether e differs significantly from 
0.5, because the data points are not independent (they arise from images with overlapping content). 


3.2. Experimental Investigation: JPEG Steganography. The experiments of the previous section were repeated for 
steganography and steganalysis in JPEG images, to see whether the square root law still holds. An improved version of F5, 
the so-called no-shrinkage F5 (nsF5) [37] was used as an example of a leading steganographic method in the JPEG domain. 
The nsF5 has the same embedding operation but uses wet paper codes [34] to remove shrinkage. The syndrome-coding 
mechanism in nsF5 was disabled because it introduces non-linearity between the payload and the number of embedding 
changes [37]. 

Measuring the size of a JPEG image is not as simple as counting pixels. After lossy compression, many of the 
DCT coefficients become zero and do not convey content: these coefficients cannot be used for embedding. Therefore 
one should define the steganographic size as the total number of nonzero DCT coefficients (abbreviated nc). This is a 
generally-accepted measure, although some authors also discount DC coefficients. 
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FIGURE 3.5. Detectability as a function of cover size (nonzero DCT coefficients). No-shrinkage ΕΘ 
steganography with matrix embedding disabled, in JPEG covers, detected by method of [67]. 


The PI began with approximately 9200 never-compressed images of different sizes, and from them cropped 15 sets of 
cover images each with a specified number of nonzero coefficients, 2-104, 4-10*,...,30-10*, all under JPEG compression 
with quality factor 80. None of the images were double-compressed. As in the spatial-domain experiments, cropping was 
favored over scaling: the latter produces images with a higher number of nonzero DCT coefficients on higher frequencies, 
so statistics of DCT coefficients in scaled images vary substantially with cover size. Also paralleling the experiments in 
the previous section, the crop region was chosen to preserve some other characteristics of the cover. In the case of JPEG 
images, an attempt was made to preserve the proportion of nonzero DCT coefficients. 

In each set of covers, a random message was embedded using the nsF5 algorithm. As before, the strategies for choosing 
the payload were to embed a fixed size payload into all cover sets, to embed payload proportional to the square root of 
the number of nonzero coefficients, and to embed payload proportionally to the number of nonzero coefficients. 

The combination of Support Vector Machine (SVM) classifiers [16] with a Gaussian kernel and the 274-dimensional 
merged feature set [67] is the state of art general purpose steganalytic system for JPEG images. The PI measured 
detectability using SVMs trained specifically to each combination of cover and payload size: for each such combination, 
6000 images were selected at random from the available set of 9200, split into disjoint sets of 3500 for training and 2500 
for testing. In the training stage, the 3500 cover images and 3500 corresponding stego images were used; similarly in the 
testing stage, the 2500 cover images and 2500 corresponding stego images were all classified by the SVM. The training 
and testing of the SVM classifiers was repeated 100 times with different random selections of training and testing sets, 
and the overall 1 — Pg metric computed for the resulting binary classifiers. 

Additionally, the MMD between the “merged feature set” vectors in cover and stego images was computed. Again, 6000 
images were selected at random, this time partitioned into disjoint sets of 3000 covers and 3000 stego images (disjoint 
sets are necessary for good MMD estimation, see Section3.4). This was repeated 100 times with random allocations of 
cover and stego images: increasing the accuracy of the estimate, and also allowing the PI to estimate rough bootstrap 
confidence intervals. Prior to computing MMD, the vectors were normalized so that each cover feature had zero mean 
and unit variance: note that, although the MMD kernel 7 parameter (see Section 3.4) is fixed for all cover sizes, the 
normalization parameters are determined separately for each set. This proved necessary because great variability was 
observed in the raw feature distributions, as the cover size varied. 

The results of the experiment (Figure 3.5) confirm the theoretical predictions, and are similar to the results presented 
for the spatial domain. For fixed (respectively, linear) payload, by any metric the detectability increases (resp. decreases) 
with the cover size, and for payload proportional to the square root of nc the detectability is approximately constant, 
albeit with a barely-visible downwards trend. It is not known why the MMD measure shows this as a slightly stronger 
effect than Pr. 

Following the previous section, the next experiment was to find payload such that the probability of error Pg matches a 
certain level. The search for the payload was carried under the reasonable assumption that the detectability increases with 
the payload size. The Pg measure at each given payload was estimated by the accuracy of the classifier (again, a SVM 
with a Gaussian kernel employing “merged” features) targeted to a given combination of nc. and payload. The training 
and testing conditions were the same as in the previous experiment. Even though repeated training of the classifier is 
very time consuming, this approach was favoured because it provides good estimates of Pe. 
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FIGURE 3.6. Capacity (y-axes, determined by limit on Pg) as a function of cover size (z-axis), log-log 
scale, with best-fit trend lines. No-shrinkage F5 in JPEG covers. 


Figure 3.6 shows maximum payload M plotted against nc N in log-log scale for Pg = 0.1 and Pg = 0.25. Payloads 
were identified within 1% accuracy of the desired Pg level. In both cases, the graph shows a close accordance with a 
straight line and the slope of the line is close to 0.5. This shows that the capacity of JPEG images for nsF5 (without 
matrix embedding) grows with square root of the number of nonzero DCT coefficients. 


3.3. Discussion. The purpose of this work was to verify the square root law of secure steganographic payload. Using 
carefully-designed experiments, which as far as possible isolate the effect of cover size from other cover properties, the 
square root law was tested for a number of steganography schemes, using contemporary steganalysis detectors. Close 
adherence to the law was observed. 

It is not widely known that the secure capacity of a cover is proportional only to the square root of its size (where 
size should be measured by. available embedding locations), in the absence of perfect steganography. It seems to be of 
fundamental importance to the practice of steganography, and could be particularly vital for the design of steganographic 
file systems, where the user might expect to be given an indication of secure capacity. 

However, when interpreting the square root law one must take care not to ignore other important factors which con- 
tribute to capacity. In practice, properties of cover images such as saturation, local variance, and prior JPEG compression 
or image processing operations have been shown to have significant effects on detectability of payload [5]. One cannot 
simply conclude that, because one cover is twice as larger as another, it can carry ν 2 times the payload at an equivalent 
risk. The law applies other all things being equal and, as the difficulties constructing suitable experiments to test the law 
illustrate, rarely are cover images equal. 

It should also be emphasised that the law truly applies not to raw payload size but to the embedding changes caused. 
In some embedding schemes these quantities are not proportional. For example, using syndrome coding [30] and binary 
embedding operations it is possible to design embedding codes for which the number of embedding changes c and payload 
size M approaches asymptotically the bound c > Nh~!(M/N), where h is the binary entropy function. The consequence 
of an asymptotic limit ¢ = O(N). is then M = O(VNlogN). It would appear that, fundamentally, steganographic 
payload capacity is of order VN log N. 

One could argue that, because of the square root law, researchers should cease to report payloads measured in bits per 
pixel, bits per second, bits per nonzero coefficient, etc: the correct units should perhaps be bits per square root pixel and 
so on. However, such a change would still not allow comparability of different authors’ benchmarks, because of the other 
factors affecting detectability; unless different authors use covers from the same source, their results cannot be exactly 
comparable in any case. 


3.4. The MMD Measure. Maximum Mean Discrepancy (MMD) [42] is a recently-developed measure of difference 
between probability distributions. If X and Y are random variables with the same domain Δ΄ then their MMD is defined 
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as 
(3.1) max |E[f(X)] — E[f(¥)]], 

where the maximum is taken over all mappings ἢ : X +} R from a unit ball # in'a Reproducing Kernel Hilbert Space 
(RKHS). Although not a true metric, the MMD is symmetric, nonnegative, and zero only when X and Y have the same 
distribution. 

For technical reasons it is simpler to use the square of the MMD measure in (3.1) and in this report the PI always 
reports squared MMD values. Given n independent observations x = (z),...,2,) of X and a further n independent 
observations y = (y},...,Yn) of Y, the (squared) MMD may be estimated by 

] 
ai — Th 2, k(x, 23) + Κίψι, yj) — 2k(x3, y5) 
where k is a bounded universal kernel k : XY x X +} R that defines the dot product in the RKHS [82]. The variance of 
the estimator decreases as 1/,/n, almost independently of the dimension of the random variables [43], and can also be 


improved by bootstrapping. MMD has been used for comparing security of stego-schemes in [68]. 
In this report, MMD is measured w.r.t. a Gaussian kernel 


k(z,y) = εχρί-- ἱῖς — yl), 
with the width parameter 7 set to ἢ 2, where ἢ is the median of the L2-distances between (normalized) features in a 


pooled set of all cover images. This choice is justified in [68]. Note that, for direct comparison of MMD values obtained 
from experiments on different cover sets, the -y parameter should remain fixed. 
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4. FISHER INFORMATION DETERMINES CAPACITY OF €-SECURE STEGANOGRAPHY 


As explained in Section 1, most practical stegosystems for digital media work by applying a mutually independent 
embedding operation to each element of the cover. For such stegosystems, the PI has shown in Section 2 that the Fisher 
information w.r.t. the change rate is a perfect security descriptor equivalent to KL divergence between cover and stego 
images. Under the assumption that cover elements form a Markov chain, in this section a closed-form expression for 
the Fisher information is derived and it is also shown how it can be used for comparing stegosystems and optimizing 
their performance. In particular, using an analytic cover model fit to experimental data obtained from a large number of 
natural images, the PI proves that the +1 embedding operation is asymptotically optimal among all mutually independent 
embedding operations that modify cover elements by at most 1. 

The key concept in essentially all communication schemes is the channel capacity defined as the amount of information, 
or largest payload, that can be safely transmitted over the channel. As shown in Section 1, for imperfect stegosystems, 
the communication rate is not a good descriptor of the channel because it approaches zero with increasing n. The sender, 
however, still needs to know what level of risk she is exposing herself to when sending a message to the recipient. It is 
critical for the sender to know how much information she can send using her stegosystem in an n-element cover while 
keeping the KL divergence between cover and stego objects below some chosen εξ. The SRL informs the sender that the 
amount of information that she can hide scales as r,/n [27], with r constant. 

In this section, the proportionality constant r from the SRL is proposed as a more refined measure of steganographic 
capacity of imperfect stegosystems. By the form of the law, r, for which the PI coins the term the root rate, essentially 
expresses the capacity per square root of cover size. Under the assumption that covers form a Markov chain and embedding 
is realized by applying a sequence of independent embedding operations to individual cover elements, a closed form 
expression is derived for the root rate. The root rate depends on the Fisher information rate w.r.t. the the change rate, 
which is a perfect security descriptor equivalent to the KL divergence between distributions of cover and stego objects 
(Section 2). Expressing the Fisher information rate analytically as a quadratic form allows one to evaluate, compare, 
and optimize security of stegosystems. To this end, the PI derives an analytic cover model from a large database of 
natural images represented in the spatial domain and shows that the +1 embedding operation is asymptotically optimal 
among all mutually independent embedding operations that modify cover elements by at most 1. Finally, using the Fisher 
information rate, the PI compares security of several practical stegosystems, including LSB embedding and +1 embedding. 
The findings appear to be consistent with results previously obtained experimentally using steganalyzers and are in good 
agreement with the experimental study of Section 3. 

The notation and definitions carry over from Sections 1 and 2. 


4.1. Fisher information in steganography. The PI now introduces the concept of root rate as a measure of capacity 
of imperfect stegosystems. First, the relationship between steganographic capacity of stegosystems satisfying Assumptions 
1-3 and the Fisher information w.r.t..the parameter β 
ἀ ain) ; 
4.1 I,(0) = E —InQr’(y? 
(4.1) 2(0) l(a as |e.) | 


is explained. Then in Section 4.2, a closed form expression for the Fisher information is derived and written in terms of 
the expected relative payload a instead of parameter § as this form is more informative for the steganographer. 

Fisher information is a fundamental quantity that frequently appears in theoretical steganography and in general in 
signal detection and estimation. For example, the Cramer-Rao lower bound states that the reciprocal of Fisher information, 
1/I,(8), is the lower bound on the variance of unbiased estimators of 8 (quantitative steganalyzers). Fisher information also 
appears in the leading term of Taylor expansion of the KL divergence d,,(8) = Dx (ΡΟ Ὁ Ὁ = β21,(0),(21η 2)-Ο(β50, 
where 


Pp) (x?) 
a EXxNn β ] 

Zero KL divergence implies zero Fisher information. Although the opposite is not true in general, it holds for all stegosys- 
tems with MI embedding and arbitrary cover model. For such stegosystems, Fisher information J,,(0) represents a perfect. 
security descriptor equivalent to the KL divergence. Fisher information was also proposed for benchmarking steganalyz- 
ers [52]. 

The relationship between the Fisher information rate and steganographic capacity of stegosystems satisfying Assump- 
tions 1-3 was established in Section 2. It was essentially shown that such stegosystems are subject to the Square Root Law, 
which means that payloads that grow faster than /n, i.e., limp... 8(n)n/./n = oo, can be detected arbitrarily accurately, 
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whereas payloads that grow slower than \/n, i.e., 8(n)n/./n < Καὶ < οὐ, lead to e-secure stegosystems, d,(8) < e.2 This 
result means that the payload that can be securely transmitted over the steganographic channel scales as r,/n. Conse- 
quently, the sequence of embedding parameters §(n) must approach zero for e-secure systems and thus the communication 
rate tends to zero. Due to this fact, it makes sense to evaluate steganographic capacity in the limit of 8(n) --Ὁ 0. 


4.2. Root rate. The problem of steganalysis can be formulated as the following hypothesis testing problem 


Ho : B = ἢ 
(4.2) H, : β»0. 
It is shown show that for small (and mr β and “με n, the likelihood ratio test with test statistic 
ro - (n) (n) 
4.3 X) = —=1n (Qz, (ΧΊ ΡΟ (Χ 
(4.3) als . (X) = (Qp (X)/P(X)), 


is a mean-shifted Gauss-Gauss problem.’ This property, usually called the Local Asymptotic Normality (LAN) of the 
detector, allows us to quantify and correctly compare security of embedding algorithms operating on the same MC cover 
model] for small values of β. 

In this case, the detector performance can be completely described by the deflection coefficient d?, which parametrizes 
the ROC curve as it binds the probability of detection, Pp, as a function of the false alarm probability, Pra, 


Pp = Q(Q7!(Pra) — Vd?). 
Here, Q(x) = 1 — (x) and ®(z) is the cdf of a standard normal variable N(0,1). Large value of the deflection coefficient 
implies better detection or weaker steganography. 
First, the PI states the LAN property for the HMC model w.r.t. the embedding parameter @ and then extends this 
result with respect to the relative payload a. | 


Theorem 8. [LAN of the LLRT] Under Assumptions 1-3, the likelihood ratio (4.3) satisfies the local asymptotic normality 
(LAN), i.e., under both hypotheses and for values of 8 up to order β3 


(4.4) Jn(T” /n + 821,2) “+ N(0, 672) under Ho 
(4.5) Jn(TS” /n — 67/2) + N(0, 671) under Hy, 


where I is the Fisher information rate, I = limy—oo +7,,(0), and -*+ is the convergence in distribution. The detection 
performance is thus completely described by the deflection coefficient 


d? = (yar ey ς nf? I. 


Proof. Due to limited space, only a brief outline of the proof is provided. The Gaussianity of the test statistic follows from 
the Central Limit Theorem (CLT) due to the fact that the test statistic is close to being i.i.d. Formal proof of this uses 
exponential forgetting of the prediction filter [18, Lemma 9] and follows similar steps as the proof of the CLT for Markov 
chains [17]. The mean and variance of the likelihood ratio (4.3) is obtained by expanding (4.3) in Taylor series w.r.t. 8 
and realizing that the leading term is the quadratic term containing the Fisher information rate. 5 


The conclusion of the theorem is now rephrased in terms of the payload rather than the parameter 8. Matrix embedding 
(syndrome coding) employed by the stegosystem may introduce a non-linear relationship 8 = f(a) between both quantities. 
In general, the payload embedded at each cover element may depend on its state i € Δ΄. Thus, the expected value of the 
relative payload that can be embedded in each cover is α(β) = Σ᾿ εν miai(8), where a;(8) stands for the number of bits 
that can be embedded into state 1 € X and 7; is the stationary distribution of the MC. The value of 8 for which a is 
maximal] will be denoted as By ax 


Bmax = arg "δες α(β[. 


For example, for ternary +1 embedding βμαχ = 2/3 and aj(8asax) = log, 3, while for binary +1 embedding βμαχ - 1.2. 
and αι,ι(βμαΧ) = 1. The matrix C is the same for both embedding methods. The only formal difference is the range of the 


3Here, it is assumed that there exists a linear relationship between §(n) and the relative payload a(n) (e.g., the stegosystem does not 
employ matrix embedding). Indeed, application of matrix embedding does not invalidate our arguments as a(n) differs from 8(n) only by a 
multiplicative factor bounded by log π. 

4In hypothesis testing, the problem of testing N (410,07) vs. Νίμι, σ2) is called the mean-shifted Gauss-Gauss problem and its detection 
performance is completely described by the deflection coefficient d? = (19 — y1)*/o7 [46, Chapter 3]. 
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parameter 8. It is also worth noting that unless all a; are the same, the maximal payload will depend on the distribution 
of individual states πὶ. 


To simplify our arguments, it will be assumed that a linear relationship between 8 _ a exists (e.g., matrix embedding 
is not considered). Therefore, one can write 


(4.6) Bae fla) = SMAR, 


πόσας, 
where a Ε [0,αμαχ] and ayax = αἰβ μαχὴ denotes the average number of bits that can be embedded into cover element 
while embedding with β = βμαχ (maximum change rate). 


From (4.6), the deflection coefficient can be expressed in terms of the relative payload α by substituting 8 = f(a) from 
(4.6) into Qg 


2 
(4.7) d? = na? ( Paar | 1. 
ΔΜΑΧ 


In practice, the sender can control statistical detectability by bounding d? < ε for some fixed ε, obtaining thus an upper 
bound on the total number of bits (payload) an that can be safely embedded (this requires rearranging the terms in (4.7)) 


QMAX /E 
(4.8) an <5 [=n. 
Bmax VI 


In analogy to the communication rate, it is natural to define the root rate 


(4.9) r & MAX 
VIBMax 
as the quantity that measures steganographic security of imperfect stegosystems in bits per square root of cover size per 
square root of KL divergence. The root rate can be used for comparing stegosystems with a MC cover model. 
In the next theorem, proved in the appendix, the PI establishes the existence of the main component of the root rate, 
the Fisher information rate J, and expresses it in a closed form. | 


Theorem 9. [Fisher information rate] Let A = (a;;) define the MC cover model and B, defined by matrix C = (c;;), 
capture the embedding algorithm. Then, the normalized Fisher information I,,(0)/n approaches a finite limit I as n > oo. 
This limit can be written as I = c? Fc, where c is obtained by arranging C into a column vector of size N? with elements 
οι}. The matriz F of size N? x ΝΞ is defined only in terms of matriz A and does not depend on the embedding algorithm. 
The elements of matrix F are 


(4.10) feign) = 7) = YV (4,9, k) — U4, 97,4, 0), 
where by the Iverson notation [7 = ἰ] is one if 7 =1 and zero otherwise and 


Vi ik) = (Sp teat) (3 ant) 


ΖΕΔ zEx 


U(a, 7, k, ἢ) = = πὶ (ox - 0 St ) + 7, (αι — Akj — = ᾿ Α 
a; 


αι] 
Moreover, \I,(0)/n—I| < C/n for some constant C. This constant depends only on the elements of matriz A and not on 
the embedding algorithm. The quadratic form I(c) = c7Fc is semidefinite, in general. 


By inspecting the proof of the theorem, the matrix F can be seen as the Fisher information rate matrix w.r.t. the 
parameters {b;;|1 < i,7 < N}. It describes the natural sensitivity of the cover source to MI embedding. The quadratic 
form then combines these sensitivities with coefficients given by the specific embedding method and allows us to decompose 
the intrinsic detectability caused by the cover source from the detectability caused by the embedding algorithm. 


Corollary 10. For the special case when the MC degenerates to an i.i.d. cover source with distribution P = 7, the Fisher 
information rate simplifies to 
We Whe 
1 τῷ νὴ 4 Chey. 
Cij _ = 


i,j,kEX 7 


“The order of elements in C is immaterial as far as the same ordering is used for pairs (i, 7) and (k,/) in matrix F. 
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4.3. Maximizing the root rate. In the previous section, it was established that.the steganographic capacity of imperfect 
stegosystems should be measured as the root rate (4.9) defined as the payload per square root of the cover size and per 
square root of KL divergence. The most important component of the root rate is the stegosystem’s Fisher information 
rate, for which an analytic form was derived in Theorem 9. The steganographer is interested in designing stegosystems 
(finding C) with the highest possible root rate. This can be achieved by minimizing the Fisher information rate or by 
embedding symbols from a larger alphabet, i.e., increasing the ratio aagax/B8aax. In this section, two general strategies 
are described for maximizing the root rate that are applicable to practical stegosystems. In Section 4.6, conclusions are 
drawn from experiments when these strategies are applied to real cover sources formed by digital images. 

Before proceedings with further arguments, the PI points out that the highest root rate is obviously obtained when the 
Fisher information rate is zero, 1 = 0. This can happen for non-trivial embedding (C # 0) in certain sources because the 
Fisher information rate is a semidefinite quadratic μα. Such πὰ δ however, would be perfectly secure and thus 
by Assumption 3 are excluded from our consideration.°® 

The number of bits, ας, that can be embedded at each state i € *¥ is bounded by the entropy of the ith row of 
B=1+ SC, H(B,.). Thus, in the most general setting, one wishes to maximize the root rate 


2. 7H (Bie (Barax)) 1 
Bmax νΊ 


w.r.t. matrix C. The nonlinear objective function makes the analysis rather complicated and the result may depend 
on the distribution of individual states 7. Moreover, even if one knew the optimal solution, care needs to be taken in 
interpreting such results, because a practical algorithm allowing us to communicate the entropy of the additive noise may 
not be available. The PI is aware of only a few practical embedding algorithms that communicate the maximal amount 
of information (LSB embedding with binary symbols and +1 embedding with ternary symbols). In practice, stochastic 
modulation [32] can be used in some cases to embed information by adding noise with a specific pmf (matrix C), but the 
specific algorithms described in [32] are suboptimal. 

In the rest of this section, two different approaches are presented how to optimize the embedding algorithm under 
different settings that are practically realizable. 


4.4. Optimization by convex combination of known methods. One simple and practical approach to optimize the 
embedding method is obtained by combining existing stegosystems S$‘) and S$‘). Suppose the sender embeds a portion 
of the message into An elements, 0 < λ «1, using S“ and use the remaining (1 —)n elements to embed the rest of the 
message using S$‘). If both sender and the receiver select the elements pseudo-randomly based on a stego key, the impact 
on a single cover element follows a distribution obtained as a convex combination of the noise pmfs of both methods. 
Note that the methods are allowed to embed a different number of bits per cover element since the receiver knows which 
symbol to extract. from each part of the stego πων μ τ S®) represent the ith embedding method with matrix C, or 
its vector representation c), with ratio p) =a ,,/6\),y for i € {1,2}. The root rate r(A) of the method obtained by 
the above approach (convex ‘qutnebliing) with parameter \ can be written as 


.ς χρῶτα 
(Ac) + (1 aa λ)ο( 2 TEAC + (1 m )c(2)) 
Ap) + (1 — djp 


JIA ἃ — APTA) + 2A(1 — A) IO” 


where J") is the Fisher information rate of S“ and J?) = (CO) Fel), Here, the symmetry of F was used to write 
1(1.2) = (2.2). 


r(A) = 


(4.11) = 


4.5. Minimizing the Fisher information rate. In an alternative setup, one may deal with the problem of optimizing 
the shape of the additive noise pmf under the assumption that the number of bits, ας, embedded at each state i € X is 
constant. For example, one may wish to determine the optimal pmf that would allow communicating 1 bit per element 
(a; = 1, Vi € &) by changing each cover element by at most 1. In this problem, the ratio ayax/Bmax, as well as the 
cover model (matrix A), are fixed and known. The task is to minimize the Fisher information rate J. 

The PI now formulates the optimization problem by restricting the form of the matrix C = (c;;), or its vector repre- 
sentation c = (c;;) € RY "x1 to the following linear parametric form 


(4.12) c=Dv+e, 


5An example of such a stegosystem is LSB embedding in i.i.d. covers with m2; = m2,;+) for all i. 
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where D = (dj;) is a full-rank real matrix of size N? x k, e is a real column vector of size ΝΖ, and v = (v,..., vx)" is 
ἃ k-dimensional column vector. It will be assumed that v € V, where V is bounded by a set of linear inequalities’ and 
the constraint Σ. cj; = 0 for all i Ε {1,...,N}. In other words, the matrix C is decomposed into k real parameters 2,, 


2 Ε {1,...,k}. The following example shows one such representation for a stegosystem whose embedding changes are at 
most 1. 


Example 11. [Tridiagonal embedding] Set cj, = —1, συ, ι--1 = vier, and οι ΞΞ 1 — vj-1 for i € {2,...,N —1} (and 
suitably defined at the boundaries). This allows modeling +1 embedding, LSB embedding, and all possible MI embedding 
methods that modify every element by at most 1. By setting c;; = —1 for all ὁ, the sender constraints her choices to 
stegosystems that.embed the same payload into every state 1 € ¥ for all 8 > 0. This model has k = N — 2 parameters 
and the set V is formed by v; € (0, 1], 7 € {1,...,k}. 


The sender’s task is to minimize the Fisher information rate for embedding methods given by (4.12). The function 
I(v) = (Dv + e)? F(Dv + δ) can attain its minimum either at a point with a zero gradient® (a critical point) or on the 
boundary of V. The PI now derives a set of linear equations for the set of all possible critical points. This approach 
will be used in Section 4.6 to prove that ternary +1 embedding is asymptotically optimal within the class of tridiagonal 
embedding in spatial domain. 

For the chosen parametrization, the gradient w.r.t. every parameter v; can be expressed as 


δας; ἃ Τ “(ΩΤ 
Du; I(v) = Dn, (Dv + e)* F(Dv + e) = 2(D.;)* F(Dv + e), 


where D,; is the jth column of matrix D. Because every possible candidate vo for the optimal parameters must satisfy 
(0/0v;)I(v)|v=v9 = 0 for every 7 € {1,..., k}, all critical points are solutions of the following linear system 


(4.13) D’ FDv = —D’ Fe. 


If this system has a unique solution vo € VY, then vp corresponds to matrix C achieving the global minimum of the Fisher 
information rate, which corresponds to the best MI embedding method w.r.t. V and a given MC cover source. 


4.6. Experiments. In the previous section, two strategies for maximizing the root rate for practical stegosystems were 
outlined. This section presents specific results when these strategies are applied to stegosystems operating on 8-bit gray- 
scale images represented in the spatial domain. Although images are two dimensional objects with spatial dependencies in 
both directions, they are represented in a row-wise fashion as a first-order Markov Chain over Δ΄ = {0,...,255}. The MC 
model represents the first and simplest step of capturing pixel dependencies while still retaining the important advantage 
of being analytically tractable. Then, a parametric model is adopted for the transition probability matrix of this Markov 
cover source and it is shownto be a good fit for the empirical transition probability matrix A estimated from a large 
number of natural images. This analytic model is used to evaluate the root rate (4.9) of several stegosystems obtained by 
a convex combinations of known methods. Finally, it is shown that the optimal embedding algorithm that modifies cover 
elements by at most 1 is very close to +1 embedding. 

In principle, in practice one could calculate the Fisher information rate using equation (4.10) with an empirical matrix 
A estimated from a large number of images. However, this approach may give misleading results because (4.10) is quite 
sensitive to small perturbations of aj; with a small value (observe that J = +00 ifa;; = 0). This is not going to be an issue 
in practice since rare transitions between distant states are probable but content dependent, which makes them difficult 
to be utilized for steganalysis. Because small values of aj; can not be accurately estimated in practice, the matrix A is 
represented with the following parametric model 


δ. ἡ ἁλῶν 
(4.14) ai; > (jt—3]/7) 
where Z; = Bei ε΄ ἐπ} τ)" is the normalization constant. The parameter controls the shape of the distribution, 
whereas 7 controls its “width.” The model parameters were found in the logarithmic domain using the least square fit 
between (4.14) and its empirical estimate. To validate this model, the least square was carried out fit separately for 
three image databases: never compressed images taken by several digital cameras? (CAMRAW), digital scans'® (NRCS), 


TE.g., one must have B > 0. 

5Note that the semidefiniteness of F guarantees that the extremum must be a minimum. 

°Expanded version of CAMERA_RAW database from [41] with 4547 8-bit images. 

10Contains 2375 raw scans of negatives coming from the USDA Natural Resources Conservation Service (http://photogallery.nrcs.usda. 
gov). 
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CAMRAW - log ai; log @127,; 


255 
0 255 7 


FIGURE 4.1. Left: plot of the empirical matrix A estimated from CAMRAW database in log domain. 
Right: comparison of the 128th row of matrix A estimated from the same database with the analytic 
model (4.14). 


and decompressed JPEG images’’ (NRCS-JPEG). Figure 4.1 shows the comparison between the empirical matrix A 
estimated from the CAMRAW database and the corresponding fit. Although this model cannot capture some important 
macroscopic properties of natural images, such as pixel saturations, it remains analytically tractable and is valid for many 
natural images. — 

The left part of Figure 4.2 shows the root rate (4.11), τίλ), for a convex combination of LSB and +1 embedding, 
λε [0,1], for different image sources. The higher the root rate r(A), the better the stegosystem. The results are consistent 
with the thesis that +1 embedding is less detectable than LSB embedding. Similarly, the capacity of stegosystems with 
covers from NRCS (scans) is believed to be higher than the capacity of stegosystem with decompressed JPEGs or images 
from digital cameras. This fact is in agreement with the obtained result for all values of the convex combination of LSB 
and +1 embedding and it can be attributed it to the fact that scans contain a higher level of noise that masks embedding 
changes. In contradiction with expectations, decompressed JPEGs from NRCS-JPEG have a higher root rate than raw 
images from digital cameras (CAMRAW). This phenomenon is probably caused by the simplicity of the MC model, which 
fails to capture JPEG artifacts because they span across larger distances than neighboring pixels. 

The methodology described in Section 4.5 is now used to maximize the root rate with respect.to stegosystems that 
modify each cover element by at most 1. It is done for the cover model fit obtained from the NRCS database. Assuming 
the embedding operation is binary, it can embed one bit per cover element. Thus, it is sufficient to find the MI embedding 
that attains the minimum Fisher information rate. The PI used the parametrization from Example 11 and solved the 
system of equations (4.13). This system has only one solution v = (v,..., 254) € V=[0,1]7*4 and thus it represents MI 
embedding with the minimum Fisher information rate. This solution is shown in the right part of Figure 4.2 along with 
the representation of the +1 embedding operation. The optimal MI embedding differs from +1 embedding only at the 
boundary of the dynamic range. This is due to the finite number of states in the MC model. It was experimentally verified 
that the relative number of states with |v; — 0.5| > 6 tends to zero for a range of 6 > 0 as N — oo for fixed parameters 
of the analytic model.’* Thus, the boundary effect is negligible for large N. This suggests that the loss in capacity when 
using +1 embedding algorithm is negligible for large N or, in other words, +1 embedding is asymptotically optimal. 


4.7. Discussion. In sharp contrast with the well-established fact that the steganographic capacity of perfectly secure 
stegosystems increases linearly with the number of cover elements, n, the square root law states that steganographic 
capacity of a quite wide class of imperfect stegosystems is only proportional to ,/n. The communication rate of imperfect 
stegosystems is thus non-informative because it tends to zero with n. Instead, an appropriate measure of capacity is the 
constant of proportionality in front of νι, for which the term the root rate is coined whose unit is bit per square root 
of cover size per square root of KL divergence. The root rate is shown to be inversely proportional to the square root of 
the Fisher information rate of the stegosystem. Adopting a Markov model for the cover source, an analytic formula for 
the root rate can be derived with Fisher information rate expressible as a quadratic form defined by the cover transition 
probability matrix evaluated at a vector fully determined by the embedding operation. This analytic form is important 
as it enables comparing capacities of imperfect stegosystems as well as optimizing their embedding operation (maximize 


‘Images from NRCS database compressed with JPEG quality factor 70. 
12The same is likely to be true for all ὃ > 0. 
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Convex combin. of LSB and +1 Optimal tridiagonal embedding v 
r(A) 


0- LSB Η Τ΄ ὧς 1 ἶ 254 


FIGURE 4.2. Left: the root rate r(A) = αμαχιίβμαχν] ) of a convex combination of LSB and +1 
embedding for different image sources. Right: optimal parameters v = (v1,...,v254) of MI embedding 
(4.12) minimizing the Fisher information rate while modifying cover elements by at most 1. The difference 
between +1 embedding and optimal MI embedding is due to boundary effects that vanish as N — oo. 


the root rate). By fitting a parametric model through the empirical transition probability matrix for neighboring pixels of 
real images, the model is used to compute and compare the root rate of known steganographic schemes and their convex 
combinations. In agreement with results previously established experimentally using blind steganalyzers, the analysis 
indicates that ternary +1 embedding is more secure than LSB embedding and it is also optimal among all embedding 
methods that modify pixels by at most 1. Furthermore, by analyzing image databases of raw images from different sources, ’ 
it has been established that the root rate is larger for images with higher noise level as is to be expected. Among the 
surprising results of our effort, the PI points out the fact that the root rate for +1 embedding is only about twice larger 
than for LSB embedding, which contrasts with the fact that current best steganalyzers for LSB embedding are markedly 
more accurate than the best steganalyzers of +1 embedding. This hints at the existence of significantly more accurate 
detectors of +1 embedding that are yet to be found. 


4.8. Proofs. Proof of Theorem 9: Only the main idea of the proof is presented in this report, leaving the main bulk 
of technical details to the report [19]. The decomposition of the sequence J,,(0)/n to a quadratic form and its properties 
can be obtained directly from the definition of Fisher information 
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The derivatives of the log-likelihood are evaluated at B = I because Β(β) = 1+ 8C and β = 0. By using Qa(yj) = 
doar cxn P(zt)Qa(y?|z7t), the random variable οί", 1,7, k,1) does not depend on the embedding method. This is because 
the derivatives are evaluated at B = I] and thus only contain the elements of the cover source transition matrix A. The 
proof of the convergence of -1iEp (9(Y7", 2,9, k,l)] to fci.3),¢&,2) and its closed form is more involved and is presented in the 
report [19]. The semidefinitness of the quadratic form follows from semidefiniteness of the Fisher information matrix F. 
It is not positively definite because for an i.i.d. cover source all rows of matrix F coincide and are thus linearly dependent. 
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5. CONSTRUCTING PRACTICAL STEGOSYSTEMS USING SYNDROME-TRELLIS CODES 


The previous sections describe the achievements in fundamental understanding of steganographic capacity in imperfect 
stegosystems. This section describes a complete practical methodology for constructing minimal-distortion steganographic 
systems in practice. By this, the PI means how to embed a given payload while introducing the smallest possible distortion. 
As long as one can tie the distortion to statsitical detectability, this approach provides (near) optimal constructions. 

The idea is to proceed by steps from an easier problem to more complex ones. The simplest problem is to optimize 
the embedding for an additive binary embedding operation defined as follows: First, assign to each over element a scalar 
expressing the distortion of an embedding change done by replacing the cover element by its “other” value (here, the 
assumption is that the embedding operation is binary). The total distortion is assumed to be a sum of per-element 
distortions. Both the payload-limited sender (minimizing the total distortion while embedding a fixed payload) and the 
distortion-limited sender (maximizing the payload while introducing a fixed total distortion) are considered in this section. 
The non-binary case can be decomposed into several binary cases by replacing the individual bits in cover elements as 
described in one of the papers by the PI (25, 26]. The binary case can be resolved using a novel syndrome-coding scheme 
based on dual convolutional codes equipped with the Viterbi algorithm. This fast and very versatile solution achieves 
state-of-the-art results in steganographic applications while having linear time and space complexity w.r.t. the number 
of cover elements. The PI illustrates the power of the constructions for various relative payloads and different distortion 
profiles, including the wet paper channel. This framework substantially improves upon current coding schemes used in 
steganography, such as matrix embedding and wet paper codes. 

There exist two mainstream approaches to steganography in empirical covers, such as digital media objects: steganogra- 
phy designed to preserve a chosen cover model and steganography minimizing a heuristically-defined embedding distortion. 
The strong argument for the former strategy is that provable undetectability can be achieved w.r.t. a specific model. The 
disadvantage is that an adversary can usually rather easily identify statistical quantities that go beyond the chosen model 
that allow reliable detection of embedding changes. The latter strategy is more pragmatic -- it abandons modeling the 
cover source and instead tells the steganographer to embed payload while minimizing a distortion function. In doing so, 
it gives up any ambitions for perfect security. Although this may seem as a costly sacrifice, it is not, as empirical covers 
have been argued to be incognizable [6], which prevents model-preserving approaches from being perfectly secure as well. 

While the relationship between distortion and steganographic security is far from clear, embedding while minimizing a 
distortion function is an easier problem than embedding with a steganographic constraint (preserving the distribution of 
covers). It is also more flexible, allowing the results obtained from experiments with blind steganalyzers to drive the design 
of the distortion function. In fact, today’s least detectable steganographic schemes for digital images [55, 93, 71, 65] were 
designed using this principle. Moreover, when the distortion is defined as a norm between feature vectors extracted from 
cover and stego objects, minimizing distortion becomes tightly connected with model preservation insofar the features can 
be considered as a low-dimensional model of covers. This line of reasoning already appeared in (56, 65] and was further 
developed in [28]. 

With the exception of [23], steganographers work with additive distortion functions obtained as a sum of single-letter 
distortions. A well-known example is matrix embedding where the sender minimizes the total number of embedding 
changes. Near-optimal coding schemes for this problem appeared in [3], 21], together with other clever constructions 
and extensions [98, 94, 96, 22, 97, 95]. When the single-letter distortions vary across the cover elements, reflecting thus 
different costs of individual embedding changes, current coding methods are highly suboptimal (55; 71). 


5.1. Distortion function. For concreteness, and without loss of generality, x will be called an image and z; its ith pixel, 
even though other interpretations are certainly possible. For example, r; may represent an RGB triple in a color image, a 
quantized DCT coefficient in a JPEG file, etc. Let x = (21,...,2n) Ε X = {Z}" be an n-pixel cover image with the pixel 
dynamic range Z. For example, ZJ = {0,...,255} for 8-bit grayscale images. 

The sender communicates a message to the receiver by introducing modifications to the cover image and sending a 
stego image y = (y1,..-,¥n) € V = Ζ) x x--- xT, where 7; C T are such that x; € Z,. The embedding operation 
is called binary if |Z,| = 2, or ternary if |Zj| = 3 for every pixel i. For example, +1 embedding (sometimes called LSB 
matching) can be represented by 7; = {z; — 1,2;,2; +1} with appropriate modifications at the boundary of the dynamic 
range. 

The impact of embedding modifications will be measured using a distortion function D. The sender will strive to embed 
payload while minimizing D. This section is limited to an additive D in the form?!* 


e 


13The case of embedding with non-additive distortion functions is addressed in [23] by converting it to a sequence of embeddings with an 
additive distortion. 
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(5.1) D(x, y) = > pi(x, yi), 


ἐξε 1 


where pj : X x I; + [—K, K], 0 « Καὶ « οὐ, are bounded functions expressing the cost of replacing the cover pixel z; with 
y;. Note that ρὲ may arbitrarily depend on the entire cover image x, allowing thus the sender to incorporate inter-pixel 
dependencies [65]. The fact that the value of p;(x, νι) is independent of changes made at other pixels implies that the 
embedding changes do not interact. 

The boundedness of D(x, y) is not limiting the sender in practice since the case when a particular value y; is forbidden 
(a requirement often found in practical steganographic schemes [37]) can be resolved by excluding y; from Z;. In practice, 
the sets Z;, ὁ € {1,...,n}, may depend on cover pixels and thus may not be available to the receiver. To handle this case, 
the domain of p; is expanded to ¥ x T and defined p;(x, yi) = οὐ whenever y; ¢ 7;. 

The definition of the distortion function is intentionally kept rather’general. In particular, it is not required that 
pi(x,2;) < p;(x, yx) for all y; € 1; to allow for the case when it is actually beneficial to make an embedding change instead 
of leaving the pixel unchanged. An example of this situation appears in [23]. 


5.2. Problem formulation. This section contains a formal definition of the problem of embedding while minimizing a 
distortion function. The performance bounds are stated and some numerical quantities are defined that will be used to 
compare the coding methods w.r.t. each other and to the bounds. 

It is assumed that the sender obtains her payload in the form of a pseudo-random bit stream, such as by compressing 
or encrypting the original message. Moreover, the embedding algorithm associates every cover image x with a pair {), 7}, 
where Ν᾽ is the set of all stego images into which x can be modified and π is their probability distribution characterizing 
the sender’s actions, 7(y) = P(Y = y}|x). Since the choice of {Y, 7} depends on the cover image, all concepts derived 
from these quantities necessarily depend on x as well. Thinking of x as a constant parameter ὍΝ is fired in the very 
beginning, the dependency on x is not made explicit. For this reason, one can simply write ea D(x, y). 

If the receiver knew x, the sender could send up to 


(5.2) H(r) =— 5° x(y) log x(y) 


yey 


bits on average while introducing the average distortion 


(5.3) E,(D] = δ΄ x(y)D(y) 


yey 


by choosing the stego image according to 7. By the Gel’fand—Pinsker theorem [39], the knowledge of x does not give any 
fundamental advantage to the receiver and the same performance can be achieved as long as x is known to the sender. 
Indeed, none of the practical embedding algorithms introduced in this report requires the knowledge of x or D for reading 
the message. 

The task of embedding while minimizing distortion can assume two forms: 


e Payload-limited sender (PLS): embed a fized average payload of m bits while minimizing the average distortion, 


(5.4) minimize E,{[D] subject to H(z) =m 
ὁ Distortion-limited sender (DLS): maximize the average payload while introducing a fired average distortion 
D., 
(5.5) maximize H (7) subject to Ες [1] = δ. 


The problem of embedding a fixed-size message while minimizing the total distortion D (the PLS) is more commonly 
used in steganography when compared to the DLS. When the distortion function is content-driven, the sender may choose 
to maximize the payload with a constraint on the overall distortion. This DLS corresponds to a more intuitive use of 
steganography since images with different level of noise and texture can carry different amount of hidden payload and thus 
the distortion should be fixed instead of the payload (as long as the distortion corresponds to statistical detectability). 
The fact that the payload is driven by the image content is essentially a case of the batch-steganography paradigm [50]. 
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5.2.1. Performance bounds and comparison metrics. Both embedding problems described above bear relationship to the 
problem of source coding with a fidelity criterion as described by Shannon [76] and the problem of source coding with 
side information available at the transmitter, the so-called Gel’fand-Pinsker problem [30]. Problems (5.4) and (5.5) are 
dual to each other, meaning that the optimal distribution for the first problem is, for some value of D,, also optimal for 
the second one. Following the maximum entropy principle [13, Th. 12.1.1], the optimal solution has the form of a Gibbs 
distribution (see Appendix A in [31] for derivation): 


τ. exp(—AD(y)) (a) exp( —Api exp(—Api(yi)) 4 
(5.6) my) = ΤΗΝ ἐν Ζ,(λ) 2 [Tw 


where the parameter  € [0,00) is obtained from the corresponding constraints (5.4) or (5.5) by solving an algebraic 
equation;'* Z(A) = Dey exp(—AD(y)), Zi(A) = Σὸν ες, exp(—Api(ys)) are the corresponding partition functions. Step 
(a) follows from the additivity of D, which also leads to mutual independence of individual stego pixels y; given x. 

By changing each pixel i with probability 7; (5.6) one can simulate embedding with optimal 7. This is important 
for steganography developers who can test the security of a scheme that uses the pair {V,7} using blind steganalysis 
without having to implement a practical embedding algorithm. The simulator of optimal embedding can also be used 
to assess the increase in statistical detectability of a practical (suboptimal) algorithm w.r.t. to the optimal one. This 
separation principle simplifies the search for better distortion measures since only the most promising approaches can be 
implemented. The simulators are used to benchmark different coding algorithms by comparing the security of practical 
schemes using blind steganalysis. 

An established way of evaluating coding algorithms in steganography is to compare the embedding efficiency e(a) = 
απ Ε, [0] (in bits per unit distortion) for a fixed expected relative payload a = m/n with the upper bound derived 
from (5.6). When the number of changes is minimized, e is the average number of bits hidden per embedding change. For 
general functions p;, the interpretation of this metric becomes less clear. A different and more easily interpretable metric 


is to compare the payload, m, of an embedding algorithm w.r.t. the payload, myax, of the optimal DLS for a fixed D,, 

m m 

(5.7) UD.) = τς 
MMAX 


which will be called the coding loss. 


’ 


5.2.2. Binary embedding operation. For binary embedding operations, it is enough to consider a slightly narrower class of 
distortion functions without experiencing any loss of generality. The binary case is important as the embedding method 
described here can be extended to non-binary operations [20]. 

For binary embedding with 7; = {z;,%,}, αὶ 4 Zi, let p™" = min{p;(x, z,), p;(x, Zj)} and 0; = |p;(x, z;)—p;(x, Z,)| > 0. 
Eq. (5.1) can now be rewritten as: 


(5.8) D(x, y) = δ᾽ prin + 5° οι. lon < pi(x,m))- 


ἐξε 1 ἦξε 1 
Because the first sum does not depend on y, when minimizing D over y it is enough to consider only the second term. It 
now becomes clear that embedding in cover x while minimizing (5.8) is equivalent to embedding in cover z 


ΝΜ ἐς when p™™" = p;(x, x;) 


5.9 , 
ar τι when pi" = p,(x,Z;). 


while minimizing 


(5.10) D(z, y) = 5 Wt. vi) = £ se [ys τέ zi]; 


qez:] *=1 


with non-negative costs f;(z,2;) = 0 < p;(z,Z;) = ρὲ for all i (when the cover pixel z; is changed to 2;, the distortion D 
always increases). Thus, from now on binary embedding operations will always consider distortion functions of the form: 


(5.11) D(x,y) = Che. (ys γέ zi], 


i=] 


with 9; > 0. 


144 simple binary search will do the job because both H(m) and Ex [D] are monotone w.r.t. Δ. 
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FIGURE 5.1. Lower bound on the average per-pixel distortion, E,|D)/n, as a function of relative payload 
a for different distortion profiles. 


For example, F5 [90] uses the distortion function (5.11) with 9; = 1 (the number of embedding changes), while nsF5 [37] 
employs wet paper codes, where 9; € {1,00}. In some embedding algorithms [33, 55, 71], where the cover is preprocessed 
and quantized before embedding, p; is proportional to the quantization error at pixel z;. | 

Additionally, for binary embedding operations one introduces the so-called distortion profile 0 if 0; = o(t/n) for all i, 
where ρ is a non-decreasing!® function @ : [0,1] --Ὁ {0, K]. The following distortion profiles are of interest in steganography 
(this is not an exhaustive list}: the constant profile, o(2) = 1, when all pixels have the same impact on detectability when 
changed; the linear profile, o(x) = 2x, when the distortion is related to a quantization error uniformly distributed on 
[-- 72,0 (2] for some quantization step Q > 0; and the square profile, o(x) = 3x”, which can be encountered when the 
distortion is related to a quantization error that is not uniformly distributed. 

The profile ρ is usually normalized so that E,[D)/n = Στ περι πὶ = 0.5 when embedding a full payload m = n. With 
this convention, Figure 5.1 displays the lower bounds on the average per-pixel distortion for three distortion profiles. 

In practice, some cover pixels may require Z; = {z;} and thus 9; = 00 (the so-called wet pizels (33, 35, 37]) to prevent 
the embedding algorithm from modifying them. Since such pixels are essentially constant, in this case one should measure 
the relative payload a with respect to the set of dry pirels {x;|0; < 00}, i.e., a = m/|{xi]0; < co}|. The overall channel is 
called the wet paper channel and it is characterized by the profile 9 of dry pixels and relative wetness τ = |{x;|0; = 00}|/n. 
The wet paper channel is often required when working with images in the JPEG domain [37]. 


5.3. Syndrome coding. The PLS and the DLS can be realized in practice using a general methodology called syndrome 
coding. This section briefly reviews this approach and its history while Section 5.4 explains the main contribution -- the 
syndrome-trellis codes. 

Let us first assume a binary version of both embedding problems. Let P : Z; -- {0,1} be a parity function shared 
between the sender and the receiver satisfying Τί.) 4 P(y;) such as P(x) = z mod 2. The sender and the receiver need to 
implement the embedding and extraction mappings defined as Emb: ¥ x {0,1}" > Y and Ext: Y > {0,1}” satisfying 


Ext(Emb(x,m))=m Yxé4,Vme {0,1}”, 


respectively. In particular, the knowledge of the distortion function D at the receiver is not assumed and thus the 
embedding scheme can be seen as being universal in this sense. A common information-theoretic strategy for solving the 
PLS problem is known as binning, which is here implemented using cosets of a linear code. Such a construction, better 
known as syndrome coding, is capacity achieving for the PLS problem if random linear codes are used. 


Spy reindexing the pixels, one can indeed assume that 0) < 92 <--*:S On <K. 
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In syndrome coding, the embedding and extraction mappings are realized using a binary linear code C of length n and 
dimension n — m: 


12 Emb(x, = D(x, y), 
(5.12) mb(x,m) = arg a. (x, y) 
(5.13) Ext(y) = HP(y), 


where P(y) = (P(y),---; P(yn)), HE {0,1}”*” is a parity-check matrix of the code C, C(m) = {z € {0,1}"|Hz = m} is 
the coset corresponding to syndrome m, and all operations are in binary arithmetic. 

Unfortunately, random linear codes are not practical due to the exponential complexity of the optimal binary coset 
quantizer (5.12), which is the most challenging part of the problem. STCs form a rich class of codes for which the quantizer 
can be solved optimally with linear time and space complexity w.r.t. n. 

Since the DLS is a dual problem to the PLS, it can be solved by (5.12) and (5.13) once an appropriate message size m 
is known. This can be obtained in practice by m = myax(1—l’), where muax = ΗΠ (πλ) is the maximal average payload 
obtained from the optimal distribution (5.6) achieving average distortion 19, and I’ is an experimentally-obtained coding 
loss one expects the algorithm will achieve. 


5.3.1. Prior Art. The problem of minimizing the embedding impact in steganography, introduced above as the PLS 
problem, has been already conceptually described by Crandall [15] in his essay posted on the steganography mailing list 
in 1998. He suggested that whenever the encoder embeds at most one bit per pixel, it should make use of the embedding 
impact defined for every pixel and minimize its total sum: 


“Conceptually, the encoder examines an area of the image and weights each of the options that allow it 
to embed the desired bits in that area. It scores each option for how conspicuous it is and chooses the 
option with the best score.” 


Later, Bierbrauer (3, 4] studied a special case of this problem and described a connection between codes (not necessarily 
linear) and the problem of minimizing the number of changed pixels (the constant profile). This connection, which 
has become known as matrix embedding (encoding), was made famous among steganographers by Westfeld [90] who 
incorporated it in his F5 algorithm. A binary Hamming code was used to implement the syndrome-coding scheme for the 
constant profile. Later on, different authors suggested other linear codes, such as Golay [83], BCH [75], random codes of 
small dimension [38], and non-linear codes based on the idea of a blockwise direct sum [4]. Current state-of-the-art methods 
use codes based on Low Density Generator Matrices (LDGMs) [31] in cornbination with the ZZW construction [95]. The 
embedding efficiency of these codes stays rather close to the bound for arbitrarily small relative payloads [28]. 

The versatile syndrome-coding approach can also be used to communicate via the wet paper channel using the so-called 
wet paper codes [33]. Wet paper codes minimizing the number of changed dry pixels were described in [34, 75, 97, 22). 

Even though other distortion profiles, such as the linear profile, are of great interest to steganography, no general 
solution with performance close to the bound is currently known. The authors of [55] approached the PLS problem by 
minimizing the distortion on a block-by-block basis utilizing a Hamming code and a suboptimal quantizer implemented 
using a brute-force search that allows up to three embedding changes. Such an approach, however, provides highly 
suboptimal performance far from the theoretical bound (see Figure 5.8). A similar approach based on BCH codes and a 
brute-force quantizer was described in [71] achieving a slightly better performance than Hamming codes. Neither Hamming 
or BCH codes can be used to deal with the wet paper channel without significant performance loss. To the best of PI’s, 
no solution is known that could be used to solve the PLS problem with arbitrary distortion profile containing wet pixels. 


5.4. Syndrome-Trellis Codes. In this section, the focus is on solving the binary PLS problem with distortion func- 
tion (5.10). _The resulting codes are called the syndrome-trellis codes. These codes will serve as a building block for 
non-binary PLS and DLS problems as described in [20]. 

The construction behind STCs is not new from an information-theoretic perspective, since the STCs are convolutional 
codes represented in a dual domain. However, STCs are very interesting for practical steganography since they allow 
solving both embedding problems with a very small coding loss over a wide range of distortion profiles even with wet 
pixels. The same code can be used with all profiles making the embedding algorithm practically universal. STCs offer 
general] and state-of-the-art solution for both embedding problems in steganography. Here, the PI gives the description 
of the codes along with their graphical representation, the syndrome trellis. Such construction is prepared for the Viterbi 
algorithm, which is optimal for solving (5.12). Important practical guidelines for optimizing the codes and using them 
for the wet paper channel are also covered. Finally, the PI studies the performance of these codes by extensive numerical 
simulations using different distortion profiles including the wet paper channel. 
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FIGURE 5.2. Example of a parity-check matrix H formed from the submatrix H (h = 2,w = 2) and 
its corresponding syndrome trellis. The last h — 1 submatrices in H are cropped to achieve the desired 
relative payload a. The syndrome trellis consists of repeating blocks of w + 1 columns, where “po” and 
“p;”, ὁ > 0, denote the starting and pruning columns, respectively. The column labeled | € {1,2,...} 
corresponds to the /th column in the parity-check matrix H. 


The main goal is to develop efficient syndrome-coding schemes for an arbitrary relative payload a with the main focus 
on small relative payloads (think of a < 1/2 for example): In steganography, the relative payload must decrease with 
increasing size of the cover object in order to maintain the same level of security, which is a consequence of the square root 
law [27]. Moreover, recent results from steganalysis in both spatial [64] and DCT domains [57] suggest that the secure 
payload for digital image steganography is always far below 1/2. Another reason for targeting smaller payloads is the fact 
that as a — 1, all binary embedding algorithms tend to introduce changes with probability 1/2, no matter how optimal 
they are. Denoting with R = (n —m)/n the rate of the linear code C, then a -- 0 translates to R = 1 — ἃ + 1, which is 
characteristic for applications of syndrome coding in steganography. 


5.4.1. From convolutional codes to syndrome-trellis codes. Since Shannon [76) introduced the problem of source coding 
with a fidelity criterion in 1959, convolutional codes were probably the first “practical” codes used for this problem [86]. 
This is because the gap between the bound on the expected per-pixel distortion and the distortion obtained using the 
optimal encoding algorithm (the Viterbi algorithm) decreases exponentially with the constraint length of the code [86, 44]. 
The complexity of the Viterbi algorithm is linear in the block length of the code, but exponential in its constraint length 
(the number of trellis states grows exponentially in the constraint length). 

When adapted to the PLS problem, convolutional codes can be used for syndrome coding since the best stego image in 
(5.12) can be found using the Viterbi algorithm. This makes convolutional codes (of small constraint length) suitable for 
steganography because the entire cover object can be used and the speed can be traded for performance by adjusting the 
constraint length. Note that the receiver does not need to know D since only the Viterbi algorithm requires this knowledge. 
By increasing the constraint length, one can achieve the average per-pixel distortion that is arbitrarily close to the bounds 
and thus make the coding loss (5.7) approach zero. Convolutional codes are often represented with shift-registers (see 
Chapter 48 in [59]) that generate the codeword from a set of information bits. In channel coding, codes of rates R = 1/k 
for k = 2,3,... are usually considered for their simple implementation. 

Convolutional codes in standard trellis representation are commonly used in problems that are dual to the PLS problem, 
such as the distributed source coding. The main drawback of convolutional codes, when implemented using shift-registers, 
comes from our requirement of small relative payloads (code rates close to one) which is specific to steganography. A 
convolutional code of rate R = (k — 1)/k requires k — 1 shift registers in order to implement a scheme for a = 16. 
Here, unfortunately, the complexity of the Viterbi algorithm in this construction grows exponentially with k. Instead of 
using puncturing (see Chapter 48 in [59]), which is often used to construct high-rate convolutional codes, the PI prefers 
to represent the convolutional code in the dual domain using its parity-check matrix. In fact, Sidorenko and Zyablov [78] 
showed that optimal decoding of convolutional codes (our binary quantizer) with rates R = (k — 1)/k can be carried out 
in the dual domain on the syndrome trellis with a much lower complexity and without any loss of performance. This 
approach is more efficient as ἃ —+ 0 and thus is choosen for the construction. 

In the dual domain, a code of length n is represented by a parity-check matrix instead of a generator matrix as is more 
common for convolutional codes. Working directly in the dual domain allows the Viterbi algorithm to exactly implement 
the coset quantizer required for the embedding function (5.12). The message can be extracted in a straightforward manner 
by the recipient using the shared parity-check matrix. 


5.5. Description of syndrome-trellis codes. Although syndrome-trellis codes form a class of convolutional codes and 
thus can be described using a classical approach with shift-registers, it is advantageous to stay in the dual domain and 
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Forward part of the Viterbi algorithm Backward part of the Viterbi alg. 


1 wght[0] = 0 1 embedding_cost = wght[0] - 

2 wght[i,...,27h-1] = infinity 2 state = 0, indx--, indm-- 

3 indx = indm = 1 3 for i = num of blocks,...,1 (step -1) { 
4 for i= 1,...,num of blocks (submatrices in H) { 4 for j =w,...,1 (step -1) { 

s for 7 e1,...,¥ { // for each colum 5 ylindx] = path[indx] [state] 

6 for k = 0,...,27h-1 { // for each state 6 state = state XOR (y[indx]*H_hat[j]) 
7 wO = wght[k] + x[indx] *rho[indx] 7 indx--’ 

8 wi = wght[k XOR H_hat[j]] + (1-x[indx])*rho[indx] 8 } 

9 path[indx] ΓΚ] = wi < W071: 0 // C notation 9 state = 2*state + message [indm] 

10 nevwght[k] = min(wO, wi) 10 indm-- 

1 } 11 +} 


12 indx++ 

13 wght = newwght ———_— ee 

14} INPUT: x, message, H_hat 

ι “27 ae states x = (x[1),...,xfn]) cover object 

ed ἐπι να 0,...,2 ἔν "᾿ : message = (message[i],...,message[m]) 
17 wght(j] = wght(2*j + message[indm]] H_hat[j] = j th column in int notation 
1s weht[2~(bh-1),...,27h-1] = infinity 5 

19 indm++ OUTPUT: y, embedding cost 

20 } y = (y[1],...,y[nJ]) stego object 


FIGURE 5.3. Pseudocode of the Viterbi algorithm modified for the syndrome trellis. 


describe the code directly by its parity-check matrix. The parity-check matrix H € {0,1}"*" of a binary syndrome-trellis 
code of length n and codimension m is obtained by placing a small submatrix H of size h x w along the main diagonal 
as in Figure 5.2. The submatrices H are placed next to each other and shifted down by one row leading to a sparse and 
banded H. The height ἢ of the submatrix (called the constraint height) is a design parameter that affects the algorithm 
speed and efficiency (typically, 6 < h < 15). The width of H is dictated by the desired ratio of m/n, which coincides 
with the relative payload a = m/n when no wet pixels are present. If m/n equals to 1/k for some k € N, select ὦ = k. 
For general ratios, find k such that 1/(k +1) < m/n < 1/k. The matrix H will contain a mix of submatrices of width k 
and k +1 so that the final matrix H is of size m x n. In this way, one can create a parity-check matrix for an arbitrary 
message and code size. The submatrix H acts as an input parameter shared between the sender and the receiver and its 
choice is discussed in more detail in Section 5.5.2. For the sake of simplicity, in the following description it is assumed 
that m/n = 1/w and thus the matrix H is of the size ὁ x (b- w), where ὁ is the number of copies of Η in H. 

Similar to convolutional codes and their trellis representation, every codeword of an STC C = {z € {0,1}"|Hz = 0} 
can be represented as a unique path through a graph called the syndrome trellis. Moreover, the syndrome trellis is 
parametrized by m and thus can represent members of arbitrary coset C(m) = {z € {0,1}"/Hz = χω). An example of 
the syndrome trellis is shown in Figure 5.2. More formally, the syndrome trellis is a graph consisting of ὁ blocks, each 
containing 2"(w - 1) nodes organized in a grid of w +1 columns and 2" rows. The nodes between two adjacent columns 
form a bipartite graph, i.e., all edges only connect nodes from two adjacent columns. Each block of the trellis represents 
one submatrix H used to obtain the parity-check matrix H. The nodes in every column are called states. 

Each 2 Ε {0,1} satisfying Hz = m is represented as a path through the syndrome trellis which represents the process 
of calculating the syndrome as a linear combination of the columns of H with weights given by z. Each path starts in the 
leftmost all-zero state in the trellis and extends to the right. The path shows the step-by-step calculation of the (partial) 
syndrome using more and more bits of z. For example, the first two edges in Figure 5.2, that connect the state 00 from 
column po with states 11 and 00 in the next column, correspond to adding (P(y:;) = 1) or not adding (P(y,) = 0) the 
first column of H to the syndrome, respectively.1® At the end of the first block, all paths for which the first bit of the 
partial syndrome does not match m, are terminated. This way, one obtains a new column of the trellis, which will serve 
as the starting column of the next block. This column. merely illustrates the transition of the trellis from representing 
the partial syndrome (s;,...,5p) to (S2,...,Sh41). This operation is repeated at each block transition in the matrix H 
and guarantees that 2” states are sufficient to represent the calculation of the partial syndrome throughout the whole 
syndrome trellis. : 


16The state corresponds to the partial syndrome. 
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FIGURE 5.4. Embedding efficiency of 300 random syndrome-trellis codes satisfying the design rules for 
relative payload a = 1/2 and constraint height h = 10. All codes were evaluated by the Viterbi algorithm 
with a random cover object of n = 105 pixels and a random message on the constant, linear, and square 
profiles. Codes are shown in the order determined by their embedding efficiency evaluated on the constant 
profile. This experiment suggests that codes good for the constant profile are good for other profiles. Codes 
designed for different relative payloads have a similar behavior. 


To find the closest stego object, weights are assigned to all trellis edges. The weights of the edges entering the column 
with label /, 1 € {1,...,n}, in the syndrome trellis depend on the /th bit representation of the original cover object x, 
P(x). If Ῥίαι) = 0, then the horizontal edges (corresponding to not adding the [th column of ΕΠ) have a weight of 0 
and the edges corresponding to adding the Ith column of H have a weight of ρι. If P(z;) = 1, the roles of the edges are 
reversed. Finally, all edges connecting the individual blocks of the trellis have zero weight. 

The embedding problem (5.12) for binary embedding can now be optimally solved by the Viterbi algorithm with time 
and space complexity O(2"n). This algorithm consists of two parts, the forward and the backward part. The forward part 
of the algorithm consists of n + ὃ steps. Upon finishing the ith step, one knows the shortest path between the leftmost 
all-zero state and every state in the ith column of the trellis. Thus in the final, n + bth step, one discovers the shortest 
path through the entire trellis. During the backward part, the shortest path is traced back and the parities of the closest 
stego object P(y) are recovered from the edge labels. The Viterbi algorithm modified for the syndrome trellis is described 
in Figure 5.3 using a pseudocode. 


5.5.1. Implementation details. The construction of STCs is not constrained to having to repeat the same submatrix Η 
along the diagonal. Any parity-check matrix H containing at most h nonzero entries along the main diagonal will have 
an efficient representation by its syndrome trellis and the Viterbi algorithm will have the same complexity O(2'n). In 
practice, the trellis is built on the fly because only the structure of the submatrix HI is needed (see the pseudocode in 
Figure 5.3). As can be seen from the last two columns of the trellis in Figure 5.2, the connectivity between trellis columns 
is highly regular which can be used to speed up the implementation by “vectorizing” the calculations. 

In the forward part of the algorithm, one needs to store one bit (the label of the incoming edge) to be able to reconstruct 
the path in the backward run. This space complexity is linear and should not cause any difficulty, since for h = 10, n = 10°, 
the total of 2)° - 10°/8 bytes (= 122MB) of space is required. If less space is available, one can always run the algorithm 
on smaller blocks, say n = 10*, without any noticeable performance drop. If one is only interested in the total distortion 
D(y). and not the stego object itself, this information does not need to be stored at all and only the forward run of the 
Viterbi algorithm is required. 


5.5.2. Design of good syndrome-trellis codes. A natural question regarding practical applications of syndrome-trellis codes 
is how to optimize the structure of H for fixed parameters h and w and a given profile. If H depended on the distortion 
profile, the profile would have to be somehow communicated to the receiver. Fortunately, this is not the case and a 
submatrix H optimized for one profile seems to be good for other profiles as well. In this section, the PI studies these 
issues experimentally and describes a practical algorithm for obtaining good submatrices. 

Let us suppose that the goal is to design a submatrix Hi of size h x w for a given constraint height A and relative 
payload a = 1/w. The PI was unable to find a better algorithm than an exhaustive search guided by some simple design 
rules. First, H should not have identical columns because the syndrome trellis would contain two or more different paths 
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FIGURE 5.5. Average number of wet pixels out of n = 10° that need to. be changed to find a solution to 
(5.12) using STCs with h = 11. 


with exactly the same weight, which would lead to an overall decrease in performance. By running an exhaustive search 
over small matrices, the PI observed that the best submatrices H had ones in the first and last rows. For example, when 
ἢ = 7 and ὦ = 4, more than 97% of the best 1000 codes obtained from the exhaustive search satisfied this rule. Thus, 
the search for good matrices was limited to those that did not contain identical columns and with all bits in the first 
and last rows set to 1 (the remaining bits were assigned at random). In practice, one can randomly generate 10 — 1000 
submatrices satisfying these rules and estimate their performance (embedding efficiency) experimentally by running the 
Viterbi algorithm with random covers and messages. For a reliable estimate, cover objects of size at least n = 10° are 
required. 

To investigate the stability of the design w.r.t. to the profile, the following experiment was conducted. The PI fixed 
h = 10 and w = 2, which correspond to a code with a = 1/2. The code design procedure was simulated by randomly 
generating 300 submatrices H,,..., Hgo9 satisfying the above design rules. The goodness of the code was evaluated using 
the embedding efficiency (e = m/D(x,y)) by running the Viterbi algorithm on a random cover object (of size n = 10°) 
and with a random message. This was repeated independently for all three profiles from Section 5.2.2. Figure 5.4 shows 
the embedding efficiency after ordering all 300 codes by their performance on the constant profile. Because the codes with 
a high embedding efficiency on the constant profile exhibit high efficiency for the other profiles, the code design appears 
stable w.r.t. the profile and these matrices can be used with other profiles in practice. All further results are generated 
by using these matrices. 


5.0.3. Wet paper channel. In this section, the PI investigates how STCs can be used for the wet paper channel described 
by relative wetness τ = [{{{0: = 00}|/n with a given distortion profile of dry pixels. Although the STCs can be directly 
applied to this problem, the probability of not being able to embed a message without changing any wet pixel may be 
positive and depends on the number of wet pixels, the payload, and the code. The goal is to make this probability very 
small or to make sure that the number of wet pixels that must be changed is small (e.g., one or two). Two different 
approaches are now described to address this problem. 

Let us assume that the wet channel is iid with probability of a pixel being wet 0 < 7 < 1. This assumption is plausible 
because the cover pixels can be permuted using a stego key before embedding. For the wet paper channel, the relative 
payload is defined w.r.t. the dry pixels as a = m/|{i|p; < οο}}. When designing the code for the wet paper channel 
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FIGURE 5.6. Effect of relative wetness τ of the wet paper channel with a constant profile on the embedding 
efficiency of STCs. The distortion was calculated w.r.t. the changed dry pixels only and a = m/(n—Tn). 
Each point was obtained by quantizing a random vector of n = 10° pixels. 
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FIGURE 5.7. Comparison of the coding loss of STCs as a function of the profile exponent d for different 
payloads and constraint heights of STCs. Each point was obtained by quantizing a random vector of 
n = 10° pixels. 


with n-pixel covers, relative wetness τ, and desired relative payload a, the parity-check matrix H has to be of the size 
[( -- r)an] x n. 

The random permutation makes the Viterbi algorithm less likely to fail to embed a message without having to change 
some wet pixels. The probability of failure, py, decreases with decreasing a and 7 and it also depends on the constraint 
height h. From practical experiments with n = 10° cover pixels, r = 0.8, and h = 10, the PI estimated from 1000 
independent runs py = 0.24 for a = 1/2, py = 0.009 for a = 1/4, and p, = 0 for a = 1/10. In practice, the message 
size m can be used as a seed for the pseudo-random number generator. If the embedding process fails, embedding m — 1 
bits leads to a different permutation while embedding roughly the same amount of message. In k trials, the probability of 
having to modify a wet pixel is at most p*, which can be made arbitrarily small. 

Alternatively, the sender may allow a small number of wet pixels to be modified, say one or two, without affecting the 
statistical detectability in any significant manner. Making use of this fact, one can set the distortion of all wet cover pixels 
to 6 = Ο, Ο > di coo θὲ and 6; = ρὶ fort dry. The weight c of the best path through the syndrome trellis obtained 
by the Viterbi algorithm with distortion 6; can be written in the form c = n.C +c’, where τς is the smallest number of 
wet cover pixels that had to be changed and c’ is the smallest weight of the path over the pixels that are allowed to be 
changed. 
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FIGURE 5.8. Embedding efficiency and coding loss of syndrome-trellis codes for three distortion profiles. 
Each point was obtained by running the Viterbi algorithm with n = 10° cover pixels. Hamming [55] and 
BCH [93] codes were applied on a block-by-block basis on cover objects with n = 10° pixels with a brute- 
force search making up to three and four changes, respectively. The line connecting a pair of Hamming 
or BCH codes represents the codes obtained by their block direct sum. For clarity, the PI presents the 
coding loss results in the range a € {0.5, 1] only for constraint height h = 10 of the syndrome-trellis codes. 
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FIGURE 5.9. Results for the syndrome-trellis codes designed for relative payload a = 1/2. Left: Average 
number of cover pixels (x10°) quantized per second (throughput) shown for different constraint heights 
and two different implementations. Right: Average embedding efficiency for different code lengths n (the 
number of cover pixels), constraint heights h, and a constant distortion profile. Codes of length n > 1000 
have similar performance as for n = 1000. Each point was obtained as an average over 1000 samples. 


Figure 5.5 shows the average number of wet pixels out of n = 105 required to be changed in order to solve (5.12) for 
STCs with h = 11. The exact. value of 0; is irrelevant in this experiment as long as it is finite. This experiment suggests 
that STCs can be used with arbitrary 7 as long as a < 0.7. As can be seen from Figure 5.6, increasing the amount of wet 
pixels does not lead to any noticeable difference in embedding efficiency for constant profile. Similar behavior has been 
observed for other profiles and holds as long as the number of changed wet pixels is small. 


5.5.4. Experimental results. The PI has implemented the Viterbi algorithm in C++ and optimized its performance by 
using Streaming SIMD Extensions instructions. Based on the distortion profile, the algorithm chooses between the float 
and 1 byte unsigned integer data type to represent the weight of the paths in the trellis. The following results were 
obtained using an Intel Core2 X6800 2.93GHz CPU machine utilizing a single CPU core. 

Using the search described in Section 5.5.2, the PI found good syndrome-trellis codes of constraint height ἢ € {6,...,12} 
for relative payloads a = 1/w,w Ε {1,...,20}. Some of these codes can be found in {26, Table 1]. In practice, almost 
every code satisfying the design rules is equally good. This fact can also be seen from Figure 5.4, where 300 random codes 
are evaluated over different profiles. 

The effect of the profile shape on the coding loss for o(x) = σ as a function of d is shown in Figure 5.7. The coding 
loss increases with decreasing relative payload a. This effect can be compensated by using a larger constraint height h. 

Figure 5.8 shows the comparison of syndrome-trellis codes for three profiles with other ‘codes which are known for a 
given profile. The ZZW family [96] applies only to the constant profile. For a given relative payload a and constraint 
height A, the same submatrix H was used for all profiles. This demonstrates the versatility of the proposed construction, 
since the information about the profile does not need to be shared, or, perhaps more importantly, the profile does not 
need to be known a priori for a good performance. 

Figure 5.9 shows the average throughput (the number of cover pixels n quantized per second) based on the used 
data type. In practice, 1-5 seconds were enough to process a cover object with n = 10° pixels. The same figure shows 
the embedding efficiency obtained from very short codes for the constant profile. This result indicates that the average 
performance of syndrome-trellis codes quickly approaches its maximum w.r.t. n. This is again an advantage, since some 
applications may require short blocks. 


5.6. Practical Embedding Constructions. In this section, the PI shows some applications of the proposed methodol- 
ogy for spatial and transform domain (JPEG) steganography. In the past, most embedding schemes were constrained by 
practical ways of how to encode the message so that the receiver can read it. Problems such as “shrinkage” in F5 [90, 37] 
or in MMx [55] arose from this practical constraint. By being able to solve the PLS and DLS problems close to the bound 
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FIGURE 5.10. Comparison of methods with four different weight-assignment strategies $1-S4 and nsF5 
as described in Section 5.6.1 when simulated as if the best coding scheme was available. The performance 
of strategy S4 when practically implemented using STCs with h = 8 and h = 11 is also shown. 


for an arbitrary additive distortion function,!” steganographers now have much more freedom in designing new embedding 
algorithms. They only need to select the distortion function and then apply the proposed framework. The only task left 
to the steganographer is the choice of the distortion function D. It should be selected so that it correlates with statistical 
detectability. Instead of delving into the difficult problem of how to select the best D, the PI now provides a few examples 
of additive distortion measures motivated by recent developments in steganography and shows their performance when 
blind steganalysis is used. 

In the examples below, the PI tested the embedding schemes using blind feature-based steganalysis on a large database 
of images. The image database was evenly divided into a training and a testing set of cover and stego images, respectively. 
A soft-margin support-vector machine was trained using the Gaussian kernel. The kernel width and the penalty parameter 
were determined using five-fold cross validation on the grid (C,y) € {(10*, 25. ἀκ Ε {~3,..2,4}; 9 € {-3,... , 3}}, where 
d is the binary logarithm of the number of features. The results are reported using a measure frequently used in steganalysis 
- the minimum average classification error 


(5.14) Pe = min (Pea + Pup(Pra))/2, 
where Pra and Pyp are the false-alarm and missed-detection probabilities. 


5.6.1. DCT domain steganography. To apply the proposed framework, one first needs to design an. additive distortion 
function that can be tested by simulating the embedding as if the best codes are available.. Finally, the the most promising 
approach is implemented using STCs. It is assumed that the cover is a grayscale bitmap image which is JPEG compressed 
to obtain the cover image. Let A be a set of indices corresponding to AC DCT coefficients after the block-DCT transform 
and let ας be the ith AC coefficient before it is quantized with the quantization step q; for i € A. Let 4 represent the 
set of all vectors containing quantized AC DCT coefficients divided by their corresponding quantization step. In ordinary 
JPEG compression, the values c; are quantized to x; = [c;/qi]. 

[Proposed distortion functions] 

A binary embedding operation is defined as 7; + {x;,%,} by %; = x; + sign(c;/q; — σι), where sign(x) is 1 if z > 0, -1 
if x < 0 and sign(0) € {—1,1} uniformly at random. In simple words, z; is a quantized AC DCT coefficient and %;, is the 
same coefficient when quantized in the opposite direction. Let e; = |¢;/q; — xj) be the quantization error introduced by 
JPEG compression. By replacing z; with %; the error becomes |¢;/q; — %;| = 1 — e;. If e; = 0.5, then the direction where 
ci/qi is rounded depends on the implementation of the JPEG compressor and only small perturbation of the original 
image may lead to different results. Let P(x) = σα mod 2. By construction, P satisfies the property of a parity function, 
Ῥίαι) # P(%;). The distortion function is assumed to be in the form D(x, y) = 7", 0: - [zi # yi], where n = |A|. 

The following four approaches utilizing values of e; and g; were considered. All methods assign 9; = co when ὁ; /G € 


(—0.5,0.5) and differ in the definition of the remaining values 9; as follows: 


17The additivity constraint can be relaxed and more general distortion measures can be used with the PLS and DLS problems in practice [23]. 


FINAL REPORT FOR CONTRACT FA9550-08-1-0084 ENTITLED “TOWARDS STATISTICALLY UNDETECTABLE STEGANOGRAPHY” 44 


Li-3,j Li-3,j 


Li+3,; 


Li,j 74.7.13 
[4+6-C-0 = d= αι; - σι de = Fiji — ι.}:2 


d3 = Xij42 — Vi,j43 


FIGURE 5.11. Set of 4-pixel cliques used for calculating the distortion for digital images represented in 
the spatial-domain. The final distortion p;,;(y;,;) is obtained as a sum of terms penalizing the change in 
pixel x;,; measured w.r.t. each clique containing 2;,;. | 


51: 9; = 1— 2e; if c;/q; ¢ (—0.5,0.5) (as in perturbed quantization [33)), 

S2: 0; = qi(1— 2e;) if c:/q; ¢ (—0.5,0.5) (the same as S1 but 9; is weighted by the quantization step), 

58: 0; = 1 if c;/q; € (—1, —0.5] U [0.5,1) and 9; = 1 — 2e; otherwise, and 

54: οι = qi if ¢/q € (—1, —0.5] ὦ [0.5,1) and o; = φι(] — 2e;) otherwise which is similar weight assignment as 
proposed in {71}. 

To see the importance of the side-information in the form of the uncompressed cover image, the PI also includes in the 
tests the nsF'5 [37] algorithm, which can be represented using the above formalism as x; = [c;/q;], %; = x; — sign(x;), and 
0; = 00 if z; = 0 and 9; = 1 otherwise. This way, one always has |Z;| < |z;|. The nsF'5 embedding minimizes the number 
of changes to non-zero AC DCT coefficients. 

(Steganalysis setup and experimental results] 

The proposed strategies were tested on a database of 6, 500 digital camera images prepared as described in [58, Sec. 
4.1] so that their smaller size was 512 pixels. The JPEG quality factor 75 was used for compression. The steganalyzer 
employed the 548-dimensional CC-PEV feature set [57]. Figure 5.10 shows the minimum average classification error Pr 
achieved by simulating each strategy on the bound using the PLS formulation. The strategies $1 and 82, which assign 
zero cost to coefficients c;/gq; = 0.5, were worse than the nsF5 algorithm that does not use any side-information. On the 
other hand, strategy $4, which also utilizes the knowledge about the quantization step, was the best. By implementing 
this strategy, it is necessary to deal with a wet paper channe] which can be well modeled by a linear profile with relative 
wetness τ ~ 0.6 depending on the image content. The PI implemented strategy S4 using STCs, where wet pixels were 
handled by setting 0; = C for a sufficiently large C. As seen from the results using STCs, payloads below 0.15 bits per 
non-zero AC DCT coefficient were undetectable using the employed steganalyzer. 

Note that the strategy utilized only the information obtainable from a single AC DCT coefficient. In reality, 0; will 
likely depend on the local image content, quantization errors, and quantization steps. The PI postpones the problem of 
optimizing D w.r.t. statistical detectability to future research. 


5.6.2. Spatial domain steganography. To demonstrate the merit of the STC-based multi-layered construction, the PI 
presents a practical embedding scheme that was largely motivated by [65] and [23]. Single per-pixel distortion function 
/i,j(¥i.7) Should assign the cost of changing the i, jth pixel 2;,;, first, from its neighborhood and then also based on the 
new value y;,;. Changes made in smooth regions often tend to be highly detectable by blind steganalysis which should 
lead to high distortion values. On the other hand, pixels which are in busy and hard-to-model regions can be changed 
more often. 

[Proposed distortion functions] 

The distortion function is designed based on a model built from a set of all straight 4-pixel lines in 4 different orientations 
containing the i; jth pixel. called cliques (see Figure 5.11). Based on the set of all such cliques, one decides on the value 
(i,j (¥i,j)» Due to strong inter-pixel dependencies, most cliques contain very similar values and thus differences between 
neighboring pixels tend to be very close to zero. It has been experimentaly observed [65], that the number of cliques falls 
of quickly as the differences get larger. From this point of view, any clique with small differences should lead to a larger 
distortion because there are more samples the warden can use for training her steganalyzer. 


FINAL REPORT FOR CONTRACT FA9550-08-1-0084 ENTITLED “TOWARDS STATISTICALLY UNDETECTABLE STEGANOGRAPHY” 45 


~ BOWS2 database — BOWS2 database ᾿ 
Payload-limited sender Distortion-limited sender 
— simulated emb. Ἴ —— _ simulated emb. 
“Ἢ --- STC/ML-STC | © \< τι STC/ML-STC | 
5 > 5 
δ δ 
5 S$ 
< < 
0 | | 0 ) 
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 
Relative payload a (bpp) Relative payload a (bpp) 
FIGURE 5.12. Comparison of LSB matching with optimal. binary and ternary coding with embedding 
algorithms based on the additive distortion measure (5.15) using embedding operations of three different 
cardinalities. 

More formaly, let x € {0,...,255}"'*™* be an πὶ x ng grayscale cover image, n = ning, represented in the spatial 
domain. Define the co-occurrence matrix computed from horizontal pixel differences D7’;(x) = 24,541 —2ij, ὁ =1,...,m1, 
9 =1,...,n2g-—1: 

(x) = τ" Te a ‘53 τ J+1? Dz}4-2)(X) ™ (p, q; τὴ] 
“one i=l j=l m(n2 — 3) . 


where |( (( Dz}, Drj+1 Di 49)(x) = (p, 4, r)] = (Dz * (x) = p)&(D; i541 (X ) ” q)&(Dz* 542%) = τὴ]. Clearly, Asx'4,r(X) € (0, 1] 
is the normalized count of neighboring quadruples of pixels {24,j, 2i,54+1,2i,54+2,2i,j4+3} With differences 2,441 — i,j; = Ὁ, 
71,42 — Tij+1 = 4, and 2,543 — 21342 = Τ'ὶ in the entire image. The superscript arrow “-+” denotes the fact that the 
differences are computed by subtracting the left pixel from the right one. The matrices AZ’, .(x), A}, -(x), and A>, ,.(x) 
are defined similarly. Let y;,;x~:,; be an image obtained from x by replacing the (i, 7)th pixel with value y,; ;. Finally, the 
PI defines the distortion measure D(y) = $7j2, 052; pes (¥i3) by 


ἐξεὶ 


(5.15) Pi,j (4,5) = .» Wp,q.r| Ab g,r(X) a ἊΝ τὰν (YigX~i,)I, 
p,q,ré{—255,...,255} 
δεί- ΤΙ 


where wp.r = 1/(1+ \/p* +q?+1?) are heuristically chosen weights. 
[Steganalysis setup and experimental results] 


All tests were carried out on the BOWS2 database [2] containing approximately 10, 800 grayscale images with a fixed 
size of 512 x 512 pixels coming from rescaled and cropped natural images of various sizes. Steganalysis was implemented 
using the second-order SPAM feature set with T = 3 [64]. 

Figure 5.12 contains the comparison of embedding algorithms implementing the PLS and DLS with the costs (5.15). 
All algorithms are contrasted with LSB matching simulated on the binary and ternary bounds. To compare the effect of 
practical codes, the PI first simulated the embedding algorithm as if the best codes were available and then compared 
these results with algorithms implemented using STCs with h = 10. Both types of senders are implemented with binary, 
ternary (7, = {z; —1,...,2; +1}), and pentary (Z, = {σι —2,...,2; + 2}) embedding operations. Before embedding, the 
binary embedding operation was initialized to 7; = {z;, yi} with y; randomly chosen from {z; — 1,2; +1}. The reported 
payload for the DLS with a fixed D, was calculated as an average over the whole database after embedding. 

The relative horizontal distance between the corresponding dashed and solid lines in Figure 5.12 is bounded by the 
coding loss. Most of the proposed algorithms are undetectable for relative payloads a < 0.2 bits per pixel (bpp). For 
payloads a < 0.5, the DLS is more secure. For larger payloads, the distortion measure seems to fail to capture the statistical 
detectability correctly and thus the algorithms are more detectable than when implemented in the payload-limited regime. 
Finally, the results suggest that larger embedding changes are useful for steganography when placed adaptively. 
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5.7. Discussion. The concept of embedding in steganography that minimizes a distortion function is connected to many 
basic principles used for constructing embedding schemes for complex cover sources today, including the principle of 
minimal-embedding-impact [37], approximate model-preservation [65], or the Gibbs construction [23]. The current work 
describes a.complete practical framework for constructing steganographic schemes that embed by minimizing an additive 
distortion function. Once the steganographer specifies the form of the distortion function, the proposed framework provides 
all essential tools for constructing practical embedding schemes working close to their theoretical bounds. The methods 
are not limited to binary embedding operations and allow the embedder to choose the amplitude of embedding changes 
dynamically based on the cover-image content. The distortion function or the embedding operation do not need to be 
shared with the recipient. In fact, they can even change from image to image. The framework can be thought of as an 
off-the-shelf method that allows practitioners to concentrate on the problem of designing the distortion measure instead 
of the problem of how to construct practical embedding schemes. 

The merit of the proposed algorithms is demonstrated experimentally by implementing them for the JPEG and spatial 
domains and showing an improvement in statistical detectability as measured by state-of-the-art blind steganalyzers. The 
PI has demonstrated that larger embedding changes provide a significant gain in security when placed adaptively. Finally, 
the construction is not limited to embedding with larger amplitudes but can be used, e.g., for embedding in color images, 
where the LSBs of all three colors can be seen as 3-bit symbols on which the cost functions are defined. Applications 
outside the scope of digital images are possible as long as the costs can be defined. 

The embedding using an additive distortion function has been greatly extended to essentially arbitrary distortion 
functions as described in [23] and in the next section. 
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6. GIBBS CONSTRUCTION IN STEGANOGRAPHY 


This section describes one of the main achievements of this effort — a general construction for building steganographic 
schemes using the principle of minimum embedding distortion. The contribution is based on a connection between 
steganography design by minimizing embedding distortion and statistical physics. The unique aspect of this work and one 
that distinguishes it from prior art is that the distortion function is allowed to be arbitrary, which permits considering 
spatially-dependent embedding changes. The PI provides a complete theoretical framework and describes practical tools, 
such as the thermodynamic integration for computing the rate-distortion bound and the Gibbs sampler, for simulating 
the impact of optimal embedding schemes and constructing practical algorithms. The proposed framework reduces the 
design of secure steganography in empirical covers to the problem of finding local potentials for the distortion function 
that correlate with statistical detectability in practice. By working out the proposed methodology in detail for a specific 

| choice of the distortion function, the PI experimentally validates the approach and discusses various options available to 
the steganographer in practice. 

There exist two general and widely used principles for designing steganographic methods for empirical cover objects, 
such as digital images. The first one is model-preserving steganography in which the designer adopts a model of the cover 
source and then designs the embedding to either completely or approximately preserve the model [45, 69, 72, 74, 81). 
This way, one can provide mathematical guarantee that the embedding is perfectly secure (or ¢-secure) within the chosen 
model. A problem is that empirical cover objects are notoriously difficult to model accurately, and, as history testifies, 
the model mismatch can be exploited by an attacker to construct a sensitive detection scheme. Even worse, preserving 
an oversimplified model could introduce a security weakness (8, 56, 91]. An obvious remedy is to use more complicated 
models that would better approximate the cover source. The major obstacle here is that most current model-preserving 
steganographic constructions are specific to a certain model and do not adapt easily to more complex models. 

The second, quite pragmatic, approach avoids modeling the cover source altogether and, instead, minimizes a heuristically- 
defined embedding distortion (impact). Matrix embedding [1δ], wet paper codes [36], and minimal embedding distortion 
steganography [26, 31, 33, 55, 71] are examples of this philosophy. Despite its heuristic nature, the principle of minimum 
embedding distortion has produced the most secure steganographic methods for digital media known today, at least in 
terms of low statistical detectability as measured using blind steganalyzers [37, 55, 58, 71]. Most of these schemes, however, 
use a distortion function that is additive — the total distortion is a sum of individual pixel distortions computed from the 
cover image. Fundamentally, such a distortion function cannot capture interactions among embedding changes, which 
leads to suboptimality in practice. This deficiency affects especially adaptive schemes for which the embedding changes 
have a tendency to form clusters because the pixel distortion is derived from local content or some content-dependent 
side-information. For example, the embedding changes might follow edges or be concentrated in textured regions. 

One discovers a relationship between both embedding principles when the distortion function is defined as a weighted 
norm of the difference between feature vectors of cover and stego objects in some properly chosen feature space [56, 66], an 
example of which are spaces utilized by blind steganalyzers. The projection onto the feature space is essentially equivalent 
to modeling the objects in a lower-dimensional Euclidean space. Consequently, minimizing the distortion between cover 
and stego objects in the feature space now becomes closely tied to model preservation. Yet again, in this case the distortion 
cannot be written as a sum of individual pixel distortions also because the features contain higher-order statistics, such 
as sample transition probability matrices of pixels or DCT coefficients modeled as Markov chains [11, 64, 67, 77). 

The importance of modeling interactions among embedding changes in steganography has been indirectly recognized 

by the designers of MPSteg [10] (Matching Pursuit Steganography) and YASS (73, 80]. In MPSteg, the authors use 
an overcomplete basis and embed messages by replacing small blocks with other blocks with the hope of preserving 
dependencies among neighboring pixels. The YASS algorithm taught us that a high embedding distortion may not 
directly manifest as a high statistical detectability, a curious property that can most likely be attributed to the fact that 
the embedding modifications are content driven and mutually correlated. Both approaches are heuristic in nature and 
leave many important issues unanswered, including establishing performance bounds, evaluating the methods’ performance 
w.r.t. to these bounds, and creating a methodology for achieving near-optimal performance. 
The above discussion underlines the need for a more systematic approach to steganography that can consider mutual 
interaction of embedding modifications, which is the topic of this section. The main contribution is a general framework for 
embedding using arbitrary distortion functions and a complete practical methodology for minimizing embedding distortion 
in steganography. The approach is flexible as well as modular and allows the steganographer to work with non-additive 
distortion functions. The PI provides algorithms for computing the proper theoretical bounds expressing the maximal 
payload embeddable with a bounded distortion, for simulating the impact of a stegosystem operating on the bound, and 
for designing practical steganographic algorithms that operate near the bound. The algorithms leverage standard tools 
used in statistical physics, such as Markov chain Monte Carlo samplers or the thermodynamic integration. 
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In the next section, the PI recalls the basic result that embedding changes made by a steganographic method that 
minimizes embedding distortion must follow a particular form of Gibbs distribution. The main purpose of this section 
is to establish terminology and make connections between the concepts used in steganography and those in statistical 
physics. In Section 6.2, the PI introduces the so-called separation principle, which includes several distinct tasks that 
must be addressed when developing a practical steganographic method. In particular, to design and evaluate practical 
schemes one needs to establish the relationship between the maximal payload embeddable using bounded distortion (the 
rate-distortion bound) and be able to simulate the impact of a scheme operating on the bound. In the special case 
when the embedding distortion can be expressed as a sum of distortions at individual pixels computed from the cover 
image (the so-called non-interacting embedding changes), the design of near-optimal embedding algorithms has been 
successfully resolved in the past. For completeness, and because the proposed framework builds upon these results, the PI 
briefly summarizes such known achievements in Section 6.3. Continuing with the case of a general distortion function; in 
Section 6.4 the PI describes two useful tools for steganographers — the Gibbs sampler and the thermodynamic integration. 
The Gibbs sampler can be used to simulate the impact of optimal embedding and to construct practical steganographic 
schemes (in Sections 6.5 and 6.6). The thermodynamic integration is a method for estimating the entropy and partition 
function in statistical physics and it is used for computing the rate-distortion bound in steganography. The design of 
practical embedding schemes begins in Section 6.5 with the study of distortion functions that can be written as a sum of 
local potentials defined on cliques. In Section 6.6, the PI first discusses various options the new framework offers to the 
steganography designer and then makes a connection between local potentials and image models used in blind steganalysis. 
The proposed framework is experimentally validated in Section 6.7 with a discussion of various implementation issues. 
Finally, the section is concluded in Section 6.8. 


6.1. Gibbs distribution minimizes embedding distortion. The PI first recalls a well-known and quite general fact 
that, for a given expected embedding distortion, the maximal payload is embedded when the embedding changes follow 
a Gibbs distribution. This establishes a connection between steganography and statistical physics, which, later in this 
section, will allow computing rate-distortion bounds, simulate the impact of optimal embedding, and construct practical 
embedding algorithms. 

Although the entire framework is certainly applicable to steganography in other objects than digital images, it is 
described using the terms “image” and “pixel” for concreteness to simplify the language and to allow a smooth transition 
from theory to experiméntal validation, which is carried out for digital images. An image x = (z),...,2n) € X &I 
is a regular lattice of elements (pixels) x; € Z,i € S, S = {1,...,n}. The dynamic range, Z, depends on the character 
of the image data. For example, for an 8-bit grayscale image, J = {0,1,...,255}. In general, z; can stand not only 
for light intensity values in a raster image but also for transform coefficients, palette indices, audio samples, etc. The 
proposed framework remains valid even when z, is organized into an arbitrary graph structure. For notational simplicity 
and convenience, additional conventions are adopted. Given J C S, x7 3 {z;|i € J} and x. 7 * {z;|i 8 -- J}. The 
image (2),...,2i-1, Yi, Zi41,---,2n) Will be abbreviated as y;xW~;. 

Every steganographic embedding scheme considered here will be associated with a mapping that assigns to each cover 
ΧΕ X the pair {V, 7}. Here, Y ( Δ᾽ is the set of all stego images y into which x is allowed to be modified by embedding 
and 7 is a probability mass function on )Y that characterizes the actions of the sender. The embedding algorithm is 
such that, for a given cover x, the stego image y € ) is sent with probability z(y). The stego image is thus a random 
variable Y over Y with the distribution P(Y = y) = 7(y). Technically, the set Y and all concepts derived from it in 
this report depend on x. However, because x is simply a parameter that one fires in the very beginning, the notation is 
simplified by not making the dependence on x explicit. Finally, note that the maximal expected payload that the sender 
can communicate to the receiver in this manner is the entropy 


(6.1) H(n) ὃ H(Y) = -- Σ᾽ aly) loga(y). 
yey 


To put it another way, the PI defines a steganographic method from the point of view of how it modifies the cover and 
only then deals with the issues of how to use it for communication and how to optimize its performance. The optimization 
will involve finding the distribution 7 for given x, Y, and payload (distortion). 

The following special form of the set )): = Z; x Ig x --- x Z, will be considered, where Z; C Z. For example, in 
Least Significant Bit (LSB) embedding, Z; = {x;,z;}, where the bar denotes the operation of flipping the LSB. In LSB 
matching [49] (also called +1 embedding) in an 8-bit grayscale image x, Z; = {x; — 1,2;, 2; + 1} whenever αὶ ¢ {0,255} 
and Z; is appropriately modified for the boundary cases. When |Z,| = 2 or 3 for all 1, one speaks of binary and ternary 
embedding, respectively. In general, however, the size of every set Z; can be different. For example, pixels not allowed to 
be modified during embedding (the so-called wet pixels [36]} have 7; = {z;}. 
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By sending a slightly modified version y of the cover x, the sender introduces a distortion, which will be measured 
using a distortion function 


(6.2) D: YR, 


that is bounded, i.e., |D(y)| « K, for all y € Y for some sufficiently large K. Note that D also depends on x. Allowing 

the distortion to be negative does not cause any problems because an embedding algorithm minimizes D if and only if it 

minimizes the non-negative distortion D+ Κα. The need for negative distortion will become apparent later in Section 6.5.1. 
The expected embedding distortion introduced by the sender is 


(6.3) E,[D] = Σ᾿ r(y)D(y). 
yey 

An important premise made now is that the sender is able to define the distortion function so that it is related to 
statistical detectability.'* This assumption is motivated by a rather large body of experimental evidence, such as (37, 58], 
that indicates that even simple distortion measures that merely count the number of embedding changes correlate well 
with statistical detectability in the form of decision error of steganalyzers trained on cover and stego images. In general, 
steganographic methods that introduce smaller distortion disturb the cover source less than methods that embed with 
larger distortion. 

Distortion-limited sender. To maximize the security, the so-called distortion-limited sender attempts to find a 
distribution z on ) that has the highest entropy and whose expected embedding distortion does not exceed a given D,: 


(6.4) maximize H(m) = -- Sry) log x(y) 
yey 
(6.5) subject to E,(D] = δ΄ m(y)D(y) = De. 
yey 


By fixing the distortion, the sender fixes the security and aims to communicate as large payload as possible at this level of 
security. The maximization in (6.4) is carried over all distributions 7 on 3). The PI comments on whether the distortion 
constraint should be in the form of equality or inequality shortly. 

Payload-limited sender. Alternatively, in practice it may be more meaningful to consider the payload-limited sender 
who faces a complementary task of embedding a given payload of m bits with minimal possible distortion. The optimization 
problem is to determine a distribution 7 that communicates a required payload while minimizing the distortion: 


(6.6) minimize E,({D] = Σ»ἤϑ πί(υ) Β(υ) 
γεν 
(6.7) subject to Η(πὶ ΞΞ πι. 
The optimal distribution 7 for both problems has the Gibbs form 
(6.8) ™(y) = Zany *Pi-ADW)), 
where Z(A) is the normalizing factor 
(6.9) Z(A) = Σ exp(—AD(y)). 
yey 


The optimality of 7, follows immediately from the fact that for any distribution μ᾿ with E,,(D] = >> <yu(y)D(y) = δε. 
the difference between their entropies, Η (πλ) — H(u) = Dxx(s\|7,) > 0 [92]. The scalar parameter 4 > 0 needs to be 
determined from the distortion constraint (6.5) or from the payload constraint (6.7), depending on the type of the sender. 
Provided m or D, are in the feasibility region of their corresponding constraints, the value of 4 is unique. This follows 
from the fact that both the expected distortion and the entropy are monotone decreasing in λ. To see this, realize that 
by direct evaluation 

9 
Or 
where Var,, [Ὁ] = E,,|D?]—(E,,[D])”. Substituting (6.8) into (6.1), the entropy of the Gibbs distribution can be written 
as 


(6.10) ἘΦ. [Ὁ] = —Var;,(D] < 0, 


18The ability of a warden to distinguish between cover and stego images using statistical hypothesis testing. 
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(6.11) H (ma) = log Z() + τ AEs, (DI. 
Upon differentiating and using (6.10), one obtains 
ὃ εἰς Ζ' (λ) 
(6.12) ay Ht (ta) = as (Fey Z(a) + E,,[D] —AVare, D1) 
(6.13) -- “᾽ Var,,,(D] « 0. 


The monotonicity also means that the equality distortion constraint in the optimization problem (6.5) can be replaced 
with inequality, which is perhaps more appropriate given the motivating discussion above. 

By varying A € [0,00), one obtains a relationship between the maximal expected payload (6.1) and the expected 
embedding distortion (6.3). For brevity, this relationship will be called the rate-distortion bound. What distinguishes this 
concept from a similar notion defined in information theory is that the bound is considered for a given cover x rather than 
for X, which is a random variable. At this point, it is appropriate to note that while it is certainly possible to consider x 
to be generated by a cover source with a known distribution and approach the design of steganography from a different 
point of view, namely one in which 7) is determined by minimizing the KL divergence between the distributions of cover 
and stego images while satisfying a payload constraint, it is not done here. 

βωω note that the assumption |D(y)| « K implies that all stego objects appear with nonzero probability, 7,(y) > 
Zi 109] exp(—AK‘), a fact that is crucial for the theory developed here. 


Remark 12. In statistical physics, the term distortion is known as energy. The optimality of Gibbs distribution is 
formulated as the Gibbs variational principle: “Among all distributions with a given energy, the Gibbs distribution (6.8) 
has the highest entropy.” The parameter λ is called the inverse temperature, 1 = 1/kT, where T is the temperature and 
k the Boltzmann constant. The normalizing factor Z(A) is called the partition function. 


6.2. The separation principle. The design of steganographic methods that attempt to minimize embedding distortion 
should be driven by their performance. The obvious choice here is to contrast the performance with the rate—distortion 
bound. This is a meaningful comparison for the distortion-limited sender who can assess the performance of a practical 
embedding scheme by its loss of payload w.r.t. the maximum payload embeddable using a fixed distortion. This so-called 
“coding loss” informs the sender of how much payload is lost for a fixed statistical detectability. On the other hand, it is 
much harder for the payload-limited sender to assess how the increased distortion of a suboptimal practical scheme impacts 
statistical detectability in practice. One could resolve this rather important practical issue if one was able to simulate the 
impact of a scheme that operates on the bound.'? Because the problems of establishing the bounds, simulating optimal 
embedding, and creating a practical embedding algorithm are really three separate problems, this reasoning will be called 
the separation principle. It involves addressing the following three tasks: 


(1) Establishing the rate-distortion bounds. This means solving the optimization problems (6.4) or (6.6) and 
expressing the largest payload embeddable using a bounded distortion (or minimal distortion needed to embed a 
given payload). These bounds inform the steganographer about the best performance that can be theoretically 
achieved. Depending on the form of the distortion function D, establishing the bounds is usually rather challenging 
and one may have to resort to numerical methods (Section 6.4.2). For an additive distortion (to be precisely defined 
shortly), an analytic form of the bounds may be obtained (Section 6.3). 

(2) Simulating an optimal embedding method. Often, it is very hard to construct a practical embedding 
method that performs close to the bound. However, one may be.able to simulate the impact of such an optimal 
method and thus subject it to tests using steganalyzers even when it is not known how to construct a practical 
embedding algorithm or even compute the bound (see Section 6.4). This is important for developers as one can 
effectively “prune” the design process and focus on implementing the most promising candidates. The simulator 
will also inform the payload-limited sender about the potential improvement in statistical undetectability should 
the theoretical performance gap be closed. A simple example is provided by the case of the Hamming distortion 
function D(y) = Σ΄. ἶνι # σι]. Here, the maximal relative payload a = m/n (in bits per pixel or bpp) is bounded 
by a < A(8), where 8 = ΖΡ, is the relative embedding distortion known as the change rate. In this case, one 
can simulate the embedding impact of the optimal scheme by independently changing each pixel with probability 


h-*(a). 


194 scheme whose embedding distortion and payload lay on the rate-distortion bound derived for a given cover. 
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(3) Constructing a practical near-optimal embedding method. This point is of most interest to practitioners. 
The bounds and the simulator are necessary to evaluate the performance of any practical scheme. The designer 
tries to maximize the embedding throughput (the number of bits embedded per unit time) while embedding as 
close to the distortion bound as possible. 


It should be stressed at this point that even though the optimal distribution of embedding modifications has a known 
analytic expression (6.8), it may be infeasible to compute the individual probabilities 7)(y) due to the complexity of 
evaluating the partition function Z(A), which is a sum over all y, whose count can be a very large number even for small 
images. (For example, there are 2" binary flipping patterns in LSB embedding.) This also implies that at present it is 
not clear how to compute the expected distortion (6.3) or the entropy (6.1) (these tasks are postponed to Section 6.4). 
Fortunately, in many cases of practical interest it is not necessary to evaluate m)(y) and it will be enough to merely 
sample from 7). The ability to sample from 7) is sufficient to simulate optimal embedding and realize practical embedding 
algorithms, and, in this case, even compute the rate-distortion bound. 

In some special cases, however, such as when the embedding changes do not interact, the distortion D is additive 
and one can easily compute A and the probabilities, evaluate the expected distortion and payload, and even construct 
near-optimal embedding schemes. As this special case will be used later in Section 6.6 to design steganography with more 
general distortion functions D, it is briefly reviewed in the next section. 


6.3. Non-interacting embedding changes. When the distortion function D is additive over the pixels, 


(6.14) D(y) = Σ᾿ pi(ys); 


t=] 
with bounded p; : 7, -- R, the embedding changes are said to not interact. In this case, the probability ,(y) can be 
factorized into a product of marginal probabilities of changing the individual pixels (this follows directly from (6.8)): 


᾿ oy TY (eae) 
(6.15) a(y) = I (yi) Hs ez, xP(—Api(ti)) 


The expected distortion and the maximal payload are: 


(6.16) E,,[D] = ts δ΄ a(ti)pi(ts), 
i=l ¢; 7, 
(6.17) H(m)=—- δ δ᾽ malts) log ma(ts). 
t=1 ἐ: ΕΖ, 


The impact of optima] embedding can be simulated by changing 2, to y; with probabilities πλίν,) independently of the 
changes at other pixels. Since these probabilities can now be easily evaluated for a fixed A, finding ἃ that satisfies the 
distortion (5. [0] = D.) or the payload (H(7)) =m) constraint amounts to solving an algebraic equation for λ (see [31] 
or [29]). Because both the expected distortion and the entropy are monotone w.r.t. A, the solution is unique. The only 
practical near-optimal embedding algorithm for this case known to the PI is based on syndrome-trellis codes [25]. 

It will be instructional to work out as an example the details δ the special case of binary embedding for which 
i; = {x gj with x0) = σι. Thus, p; attains only two values, pi) = pias”, t = 0,1. The PI stresses at this point 


that it is not assumed that pi” = 0 or even that po? > po). This fact will be important when implementing practical 


embedding schemes in Section 6.5.1. The above expressions simplify to 


(6.18) (zx!) = wnt “os 
(6.19) - a 2 p.(2), 
(6.20) E;,{D] = 300 (1 — ps(A)) + oy ?pi(A), 
(6.21) H(m) = " h(p;(A)). 


gz] 
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The smallest distortion we et embedding algorithm can impose is Dmin = y Dew min{ ph”), pi? }, which would be 
incurred when selecting y; = ay ) where t; = arg min;{ ps }͵ Thus, 


(6.22) Diy) = So fue = 2] + 0 [oy = 2] 
i=] 
(6.23) = Dmin+ δ᾽ 0: [yi # 2$], 
t=] 
where 9; = lot? — po! is now a vector of non-negative distortions, which allows applying the practical embedding 


algorithm described in [26]. It accepts on its input a bit stream c = (c;(x),...,Cn(x)) (representing the cover x), the 
vector of non-negative distortions (01,...,0,), and a binary message. It outputs a modified (stego) bit stream y € {0,1}" 
that conveys the message as a syndrome of a suitably chosen syndrome-trellis code so that the total embedding distortion 
>= Oily: # Ci] is near minimal. It follows from (6.23) that binary embedding as defined in this section can be implemented 
in practice by applying this algorithm to the bit stream c;(X), X = (i) Ὁ alin)). 

The complete derivation of the rate-distortion bound for binary embedding appears, e.g., in Chapter 7 of [29]. 


6.4. Simulated embedding and rate—distortion bound. In Section 6.1, it has been showed that minimal-embedding- 
distortion steganography should select the stego image y with probability 7,(y) « exp(—AD(y)) expressed in the form 
of a Gibbs distribution. The PI now explains a general iterative procedure using which one can sample from any Gibbs 
distribution and thus simulate optimal embedding. The method is recognized as one of the Markov Chain Monte Carlo 
(MCMC) algorithms known as the Gibbs sampler.”” This sampling algorithm will allow constructing practical embedding 
schemes in Sections 6.5 and 6.6. It will also be explained how to compute the rate-distortion bound for a fixed image 
using the thermodynamic integration. The Gibbs sampler and the thermodynamic integration appear, for example, in [92] 
and |62], respectively. 


6.4.1. The Gibbs sampler. The PI starts with defining the local characteristics of a Gibbs field as the conditional proba- 
bilities of the ith pixel attaining the value y; conditioned on the rest of the image: 


(yy ~i) 
6.24 m(Y% =y!V¥ui =y~:) =—— τ΄ 


For all possible stego images y, y’ € Y, the local characteristics (6.24) define the following matrices P(2), for each pixel 


16 {1,...,n}: 


m(Yi = [ων =yai) when γ΄ = yn; 
6.25 τὰς 
( ) Pyy'(i) = ‘3 otherwise. 


Every matrix P(i) has || rows and the same number of columns (which means it is very large) and its elements are mostly 
zero except when y’ was obtained from y by modifying y; to y) and all other pixels stayed the same. Because P(t) is 
stochastic (the sum of its rows is one), 


(6.26) Ly Py y’(i) =1, for all rows y, 
y’EY 


P(i) is a transition probability matrix of some Markov chain on 3). All such matrices satisfy the so-called detailed balance 
equation 


(6.27) Ty) γ ν΄ (1) = ™(y’) Py y (i), for all y,y € Y, ὁ 


20More detailed discussion regarding our choice of the MCMC sampler appear later in this section. 
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Algorithm 1 One sweep of a Gibbs sampler. 
1: Set pixel counter i = 1 
2: while 1 < n do 
3: Compute the local characteristics: 


(6.34) ων σον (σ()), Yo(s) Ε Lo (i) 


4: Select one ψ..η € Z,(:) pseudorandomly according to the probabilities (6.34) and change y,(i) + ¥,(;) 
5 1e 141 
6: end while 
7: return y 


To see this, realize that unless y.~; = y_,, one is looking at the trivial equality 0 = 0. For y~; = γ΄,» one obtains the 
following chain of equalities: 


(6.28) ™aly)Pyy (i) = aly) — oy 


ὦ) _ maly)maly’) 


(6.29) Deez, MA(tiy~:) 

= m(y’) ey) 
(6.30) = aly Ἐς: ™r(tiy;) 
(6.31) ω πλίγ Py: ν(). 


Equality (a) follows from the definition οὗ P(z) (6.25), (6) from the fact that y.; = γ΄,» and (c) from m)(y) = ma(ywy_,) 
and again (6.25). 

Next, the PI defines the boldface symbol 7 € [0, 00)!! as the vector of || non-negative elements 7) = m,(y), y € Ὁ). 
Using (6.27) and then (6.26), one can now easily show that the vector 7) is the left eigenvector of P(t) corresponding to 
the unit eigenvalue: 


(6.32) (7 P(i))y = > yi ™(y) Py, y(t) 
yey 

(6.33) = δ᾽ aly’) Pyy(t) = may’). 
yey 


In (6.32), (7,P(t))y is the y’th element of the product of the vector 7) and the matrix P(z). 

It is now time to describe the Gibbs sampler [40], which is a key element in the framework. Let o be a permutation of 
the index set S called the visiting schedule (a(t), i =1,...,n is the tth element of the permutation 7). One sample from 
πλ is then obtained by repeating a series of “sweeps” defined below. As the sweeps and the Gibbs sampler are explained, 
the reader is advised to inspect Algorithm 1 to better understand the process. 

The sampler is initialized by setting y to some initial value. For faster convergence, a good choice is to select y; from 
Z, according to the local characteristics 7)(yi:x i). A sweep is a procedure applied to an image during which all pixels 
are updated sequentially in the order defined by the visiting schedule o. The pixels are updated based on their local 
characteristics (6.24) computed from the current values of the stego image y. The entire sweep can be described by a 
transition probability matrix P(o) obtained by matrix-multiplications of the individual transition probability matrices 


P(o(t)): 
(6.35) Py,y'(o) = (P(o(1)) - P(o(2)) --- P(o(n))),  - 

After each sweep, the next sweep continues with the current image y as its starting position. It should be clear from 
the algorithm that at the end of each sweep each pixel i has a non-zero probability to get into any of its states from 
I, defined by the embedding operation (because D is bounded). This means that all elements of ¥ will be visited with 
positive probability and thus the transition probability matrix P(o) corresponds to a homogeneous irreducible Markov 
process with a unique left eigenvector corresponding to a unit eigenvalue (unique stationary distribution). Because 7) is 
a left eigenvector corresponding to a unit eigenvalue for each matrix P(i), it is also a left eigenvector for P(c) and thus its 
stationary distribution due to its uniqueness. A standard result from the theory of Markov chains (see, e.g. Chapter 4 
in [92]) states that, for an irreducible Markov chain, no matter what distribution of embedding changes v € [0, 00)'~! one 
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starts with, and independently of the visiting schedule o, with increased number of sweeps, k, the distribution of Gibbs 
samples converges in norm to the stationary distribution 7): 


(6.36) Πν (P(o))* — πλ!! 4 0 with k > 00 


exponentially fast. This means that in practice one can obtain a sample from 7) after running the Gibbs sampler for 
a sufficiently long time.*/ The visiting schedule can be randomized in each sweep as long as each pixel has a non-zero 
probability of being visited, which is a necessary condition for convergence. 


6.4.2. Simulating optimal embedding. When applied to steganography, the Gibbs sampler allows the sender to simulate the 
effect of embedding using a scheme that operates on the bound. It is interesting that this can be done for any distortion 
function D and without knowing the rate-distortion bound. This is because the local characteristics (6.24) 


exp(—AD(yiy~i)) 
(6.37) WY, = yl Yai = 4) = τας-θι-- -οὐλνως 
eo δύ εσ, xP(—AD(tiy~:)) 
do not require computing the partition function Ζ(λ). One does need to know the parameter λ, though. 
For the distortion-limited sender (6.5), the Gibbs sampler could be used directly to determine the proper value of λ in 
the following manner. For a given A, it is known (Theorem 5.1.4 in [92]) that 


k 
(6.38) ; S| D(y®) + Ex,[D] as k -- 00 
j=! 


in L and in probability, where yY) is the image obtained after the jth sweep of the Gibbs sampler. This requires running 
the Gibbs sampler and averaging the individual distortions for a sufficiently long time. When only a finite number of 
sweeps is allowed, the first few images y should be discarded to allow the Gibbs sampler to converge close enough to 7. 
The value of λ that. satisfies E,,|D] = [δὲ can be determined, for example, using a binary search over λ. 

To find \ for the payload-limited sender (6.4), one needs to evaluate the entropy H(m,), which can be obtained from 
ΕΞ [Ὁ] using the method of thermodynamic integration [62]. From (6.10) and (6.13), one obtains 


ae 
In2 0X 
Therefore, the entropy can be estimated from E,,,(D) by integrating by parts: 


(6.39) 5 Hn») == E,,(D). 


͵ λ 
(6.40) (ny) = H (m6) + | 5 Bay (D)] 


λ 
1 Γ ; 
; πατὴ / Ες, [Ρ]άλ’. 

Ao 
The value of that satisfies the entropy (payload) constraint can be again obtained using a binary search. Having obtained 
the expected distortion and the entropy using the Gibbs sampler and the thermodynamic integration, the rate-distortion 
bound [H (7), £,,[D]]} can be plotted as a curve parametrized by Δ. 

In practice, one has to be careful when using (6.38), since no practical guidelines exist for determining a sufficient 
number of sweeps and heuristic criteria are often used [14, 92]. Although the convergence to 7) is exponential in the 
number of sweeps, in general a large number of sweeps may be needed to converge close enough. Generally speaking, the 
stronger the dependencies between embedding changes the more sweeps are needed by the Gibbs sampler. In theory, the 
convergence of MCMC methods, such as the Gibbs sampler, may also slow down in the vicinity of “phase transitions,” 
which is loosely defined here as sudden changes in the spatial distribution of embedding changes when only slightly 
changing the payload (or distortion bound). 

In experiments reported later in this section, the Gibbs sampler always behaved well and converged fast. This is 
attributed to the fact that the dependencies among embedding modifications as measured using our distortion functions 
are rather weak and limited to short distances. The convergence, however, could become an issue for other types of cover 
sources with different distortion functions. While it is possible to compute the rate-distortion bounds and simulate optimal 
embedding using other MCMC algorithms, such as the Metropolis-Hastings sampler [92], that may converge faster than 
the Gibbs sampler and can exhibit a more robust behavior in practice, it is not clear how to adopt these algorithms for 
practical embedding. This is because all known coding methods in steganography essentially sample from a distribution of 
independent symbols. Thus, the Gibbs sampler comes out as a natural choice (Section 6.5) because it works by updating 
individual pixels, which is exactly the effect of embedding using syndrome-trellis codes (25, 26]. 


2!The convergence time may vary significantly depending on the Gibbs field at hand. 
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FIGURE 6.1. The four-element cross-neighborhood and the tessellation of the index set S into two disjoint 
sublattices ὅς and So. 


FIGURE 6.2. All three possible cliques for the cross-neighborhood. 


A notable alternative to the Gibbs sampler and the thermodynamic integration for computing the rate-distortion bound 
is the Wang-Landau algorithm [87] that estimates the so-called density of stego images (density of states in statistical 
physics), g(D), defined as the number of stego images y with distortion (energy) D. The partition function (and thus, 
via (6.11), the entropy) and the expected distortion can be computed from g(D) by numerical integration: 


(6.41) Z(r) = Σ᾽ g(D) exp(—AD)A, 
DED 
1 
6.42) Ε,. [Ὁ] = -ἶ- δ᾽ Dg(D)exp(—AD)A, 
( [Ρ] Zo) 2 g(D) exp 


where Ὁ = {d),...,dn,}, d) = —K, dnp = K, αἰ —dj_, = A is a set of discrete values into which the dynamic range of 
D, |—K, K] is quantized. 

The PI notes that in general it is not possible to determine ahead of time which method will provide satisfactory 
performance. In experiments described in Section 6.7, the thermodynamic integration worked very well and provided 
results identical to the much more complex Wang—Landau algorithm. 

Note that computing the rate-distortion bound is not necessary for practical embedding. In Section 6.5, one introduces 
a special form of the distortion in terms of a sum over local potentials. In this case, both types of optimal senders 
can be simulated using algorithms that do not need to compute A in the fashion described above. This is explained in 
Sections 6.5.1 and 6.5.2. 


6.5. Local distortion function. Thanks to the Gibbs sampler, one can simulate the impact of embedding that is 
optimal in the sense of (6.4) and (6.6) without having to construct a specific steganographic scheme. This is important 
for steganography design as one can test the effect of various design choices and parameters and then implement only the 
most promising constructs. However, it is rather difficult to design near-optimal schemes for a general D(y). Fortunately, 
it is possible to give the distortion function a specific form that will allow constructing practical embedding algorithms. 
It will be assumed that D is a sum of local potentials defined on small groups of pixels called cliques. This local form of 
the distortion will be still quite general to capture dependencies among embedding changes and it allows construction of 
a large spectrum of diverse embedding schemes -- a topic left for Section 6.6. 

First, the PI defines a neighborhood system as a collection of subsets of the index set {n(i) C S|t = 1,...,n} satisfying 
i ¢ η(1), Vi and 7 € η(7) if and only if 7 € n(t). The elements of 7(¢) are called neighbors of pixel i. A subset c C S isa 
clique if each pair of different elements from c are neighbors. The set of all cliques will be denoted C. The PI does not use 
the calligraphic font for a clique even though it is a set (and thus deviate here from the established convention) to comply 
with a well established notation used in previous art. 

In this section and in Section 6.6, it will be necessary to address pixels by their two-dimensional coordinates. The PI 
will hus be switching between using the index set S = {1,...,n} and its two-dimensional equivalent S = {(i,7)|1 <i< 
nmi,1 <j < no} hoping that it will cause no confusion for the reader. 


Example 13. The four-element cross neighborhood of pixel z;,; consisting of {2;~1,5, Zi41,5,2i,j-1,2i,j41} with a proper 
treatment at the boundary forms a neighborhood system (see Figure 6.1). The cliques contain either a single pixel (one- 
element) cliques {z;,;} or two horizontally or vertically neighboring pixels, {z;,;, 2,541}, {2ij,Zi+1,j} (Figure 6.2). No 
other cliques exist. 
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FIGURE 6.3. The eight-element neighborhood and the tessellation of the index set S into four disjoint 
sublattices marked with four different symbols. 


| i 


NN ΓΙ 
FIGURE 6.4. All possible cliques for the eight-element neighborhood. 


Example 14. The eight-element 3 x 3 neighborhood also forms a neighborhood system (Figure 6.3). The cliques are 
as in Example 13 as well as all cliques containing pairs of diagonally neighboring pixels, {7;,;, 2141541}, {2ij,Zi-1,j41}, 
three-pixel cliques forming a right-angle triangle (e.g., {2;,;, 2i,;+1,Zi+1,;}), and four-pixel cliques forming a 2 x 2 square 
({2i5,2i,j41)Zi+1,j3,Li+1,j3+1}) (follow Figure 6.4). No other cliques exist for this neighborhood system. 


Each neighborhood system allows tessellation of the index set S into disjoint subsets (sublattices) whose union is the 
entire set S, so that any two pixels in each lattice are not neighbors. For example, for the cross-neighborhood S = S& USo, 
where 


(6.43) Se = {(i,j)|i +7 is-even}, So = {(i, j)|i +7 is odd}. 

For the eight-element 3 x 3 neighborhood, there are four sublattices, S = ,,Sa,, 1 < a,b < 2, whose structure 
resembles the Bayer color filter array commonly used in digital cameras [29], 
(6.44) Sab = {(a + 2k, b+ 21}}1 <a+2k 4πι.,1 « δ- 2ὶ < no}. 


For a clique c € ὦ, one denotes by V-(y) the local potential, which is an arbitrary bounded function that depends only 
on the values of y in the clique c, V.(y) = V-(y-). The PI reminds that V. may also depend on x in an arbitrary fashion. 
It is time now to introduce a local form of the distortion function as 


(6.45) D(y) = δ Velyc). 
cEC 


The important fact is that D is a sum of functions with a small support. Let us express the local characteristics (6.24) in 
terms of this newly-defined form (6.45): 


exp(—A Σεες Ve(viy~i)) 
Dot ez, EXP(—A Docec Ve(tiy~i)) 
(a) — €XP(—A Di cecyiy Ve(viy~:)) 
- ener, exp(—A Σ»εεσ(ὴ Ve(tiyni))’ 


where C(i) = {c Ε Cli Ec}, i = 1,...,n. Equality (a) holds because (ἐν...) does not depend on ἐς for cliques c ¢ C(t) 
as they do not contain the ith element. Thus, the terms Κα for such cliques cancel from (6.47). This has a profound 
impact on the local characteristics, making the realization of Y; independent of changes made outside of the union of 
cliques containing pixel 7 and thus outside of the neighborhood 7(1). For the cross-neighborhood system from Example 13, 
changes made to pixels belonging to the sublattice S, do not interact and thus the Gibbs sampler can be parallelized by 
first updating all pixels from this sublattice in parallel and then updating in parallel all pixels from S,.?? 


(6.46) Tr(Y: = yly~i) = 


(6.47) 


22The Gibbs random field described by the joint distribution 2)(y) with distortion (6.45) becomes a Markov random field on the same 
neighborhood system. This follows from the Hammersley-Clifford theorem [92]. 
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Algorithm 2 One sweep of a Gibbs sampler for embedding m-bit message (payload-limited sender). 


Require: S = S; U...US, {mutually disjoint sublattices } 
1: for k =1 tos do 
2: for every i € S&, do 
3 Use (6.48) to calculate cost of changing y; > y; € Z 
4: end for 
5: Embed m/s bits while minimizing δ᾽. εις, pi(yjy~i). 
6: Update ys, with new values and keep y.s, unchanged. 
7: end for 
8: return y 


The possibility to update all pixels in each sublattice all at once provides a recipe for constructing practical embedding 
schemes. Assume S = S,U...US, with mutually disjoint sublattices. The PI first describes the actions of a payload-limited 
sender (follow the pseudo-code in Algorithm 2). 


6.5.1. Payload-limited sender. The sender divides the payload of m bits into s equal parts of m/s bits, computes the local 
distortions 


(6.48) pilyiyn:) = δ΄ Velyiy~s) 
c€C(t) 


for pixels 1 € ὅδ), amd embeds the first message part in S}. Then, it updates the local distortions of all pixels from S2 
and embeds the second part in S2, updates the local distortions again, embeds the next part in S3, etc. Because the 
embedding changes in each sublattice do not interact, the embedding can be realized as discussed in Section 6.3. After all 
sublattices are processed it will mean that one embedding sweep was completed. By repeating these embedding sweeps,”° 
the resulting modified images will converge to a sample from 7). 

The embedding in sublattice S, will introduce embedding changes with probabilities (6.15), where the value of A, is 
determined by the individual distortions {p;(yjy~;)|t € S;} (6.48) to satisfy the payload constraint of embedding m/s bits 
in the kth sublattice (again, e.g., using a binary search for A,). Because each sublattice extends over a different portion of 
the cover image while splitting the payload evenly across the sublattices, A; may slightly vary with k because of variations 
in the individual distortions. This represents a deviation from the Gibbs sampler. Fortunately, the sublattices can often 
be chosen so that the image does not differ too much on every sublattice, which will guarantee that the sets of individual 
distortions {p;(y:y~;)|t € S,} are also similar across the sublattices. Thus, with an increased number of sweeps, A, will 
converge to an approximately common value and the whole process represents a correct version of the Gibbs sampler. 

In binary embedding (Z; = (19) 2.}}}, note that the two distortions (2 yn) = D(z! Yn(i)) oh) (ay) = 
D(z) Yn(i)) at pixel t depend on the current pixel values in its neighborhood 7(%). Therefore, both ps” and ρ(" can 


be non-zero at the same time and one can even have pi?) < p° 


whether or not it is beneficial to preserve the value of the pixel! 


: 
. It is the neighborhood of i that ultimately determines 


6.5.2. Distortion-limited sender. A similar approach can be used to implement the distortion-limited sender with a dis- 
tortion limit D,. Consider a simulation of such embedding by a Gibbs sampler with the correct A (obtained from a binary 
search as described in Section 6.4.2) on the sublattice δὰ C S. Assuming again that all sublattices have the same distortion 
properties, the distortion obtained from cliques containing pixels from S, should be proportional to the number of such 
cliques. Formally, 


{cE CicNS, FO}! 
: Ici . 


As described in Algorithm 3, the sender can realize this by embedding as many bits to every sublattice as possible while 
achieving the distortion (6.49). Note that one does not need to compute the partition function for every image in order to 
realize the embedding. Moreover, in practice when the embedding is implemented using syndrome-trellis codes [26], the 
search for the correct parameter A, as described in Section 6.4.2, is not needed either as long as the distortion properties 
of every sublattice are the same. This is because the codes need the local distortion p;(y{y~;) (6.48) at each lattice pixel 
1 and not the embedding probabilities. (This eliminates the need for 4.) 


(6.49) Ex,(¥s,\¥~s,)\D] = Ὁ 


23 4 fter each embedding sweep, at each pixel the previous change is erased and the pixel is reconsidered again, just like in the Gibbs sampler. 
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Algorithm 3 One sweep of a Gibbs sampler for a distortion-limit sender, Ἐπ, [}] = D,. 


Require: S = δι)... ὦ δ. {mutually disjoint sublattices} 
1: for k = 1 to 8 do 

for every i € S; do 

Use (6.48) to calculate cost of changing y; > νἱ € 7, 
end for 
Embed m, bits while >; pi(yjy~i) = D. x |{c Ε Clo Sp τ O}I/IC\. 
Update ys, with new values and keep ys, unchanged. 
7: end for 
8: return y and δ᾽. m, {stego image and number of bits} 


The issue of the minimal sufficient number of embedding sweeps for both algorithms needs to be studied specifically for 

each distortion measure (see the discussion in the experimental Section 6.7). By replacing a specific practical embedding 
method with a simulator of optimal embedding, one can simulate the impact of optimal algorithms (for both senders) 
without having to determine the value of the parameter ἃ as described in Section 6.4.2. It is still necessary to compute 
Ax for each sublattice S; to obtain the probabilities of modifying each pixel (6.15), but this can be done as described in 
Section 6.3 without having to use the Gibbs sampler or the thermodynamic integration. 
_ Finally, the PI comments on how to handle wet pixels within this framework. Since it is assumed that the distortion 
is bounded (|D{y)| < K for all y € Y), wet pixels are handled by forcing Z; = {z;}. Because this knowledge may not be 
available to the decoder in practice, practical coding schemes should treat them either by setting p;(y;) = oo or to some 
large constant for y; # x; (for details, see [26]). 


6.5.3. Practical limits of the Gibbs sampler. Thanks to the bounds established in Section 6.1, it is now known that the 
maximal payload that can be embedded in this manner is the entropy of 7) (6.11). Assuming the embedding proceeds on 
the bound for the individual sublattices, the question is how close the total payload embedded in the image is to H(7)). 
Following the Gibbs sampler, the configuration of the stego image will converge to a sample y from 7). Let us now go 
through one more sweep with y*! denoting the stego image before starting embedding in sublattice S,, k = 1,...,s. In 
each sublattice, the following payload is embedded: 


(6.50) H(Ys, [YW _ γῆ). 
The following result from information theory is now used: For any random variables X,,..., Xs, 
§ 
(6.51) Σ 5 A(Xe|Xan) S A(X... Xs), 
k=1 


with equality only when all variables are independent.?4 Thus, in general 


(6.52) H~(¥)  Σ H(¥s,|¥as, - γι.) < Η(Χ) = H(m). 
k=1 


The term H~(Y) is recognized as the erasure entropy [84, 85] and it is equal to the conditional entropy H(Y“+)/y) 
(entropy rate) of the Markov process defined by our Gibbs sampler (c.f., (6.35)), where Y is the random variable obtained 
after | sweeps of the Gibbs sampler. 

The erasure-entropy inequality (6.52) means that the embedding scheme will be suboptimal, unable to embed the 
maximal payload H(z). The actual loss can be assessed by evaluating the entropy of H(7)), e.g., using the algorithms 
described in Section 6.4. An example of such comparison is presented in Section 6.7.3. 

The last remaining issue is the choice of the potentials V.. In the next section, the PI shows one example where V, are 
chosen to tie the principle of minimal embedding distortion to the preservation of the cover-source model. The PI also 
describes a specific embedding method and subjects it to experiments using blind steganalyzers. 


24For k = 2, this result follows immediately from H(X |X2) + H(X2|X1) = H(X1,.X2) — 1(X1;X2). The result for s > 2 can be obtained 
by induction over s. 
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6.6. Practical embedding constructions. Here, a practical embedding method that uses the theory developed so far 
is described. First and foremost, the potentials V. should measure the detectability of embedding changes. There is 
substantial freedom in choosing them and the design may utilize reasoning based on theoretical cover source models as 
well as heuristics stemming from experiments using blind steganalyzers. The proper design of potentials is a complicated 
subject in itself and is beyond the scope of this effort, whose main purpose was to introduce a general framework rather 
than optimizing the design. Here, the PI describes a specific example of a more general approach that builds upon the 
latest results in steganography and steganalysis and one that provided an opportunity to validate the proposed framework 
by showing an improvement over the current.state of the art in Section 6.7. 


6.6.1. Additive approrimation. As argued in the introduction, the steganography design principles based on model preser- 
vation and on minimizing distortion coincide when the distortion is defined as a norm of the difference of feature vectors 
used to model cover images: 


d 
(6.53) D(y) = ||f(x) -- F(y)I| © Σ wel fel) — ΚΟ}. 
k=1 
Here, f(x) = (f:(x),..., fa(x)) € R® is a d-dimensional feature vector of image x and w = (w),...,wa) are weights. 


The properties of D Alina) in this manner depend on the properties of the functions ἔκ. In general, however, D is not 
additive. In the past, steganographers were forced to use some additive approzimation of D to realize the embedding in 
practice. A general method for turning an arbitrary distortion measure into an additive proceeds is: 


(6.54) D(y) = Σ Diux~) 


Embedding with the additive measure D can be simulated (and realized) as explained in Section 6.3. The approximation, 
of course, ensues a capacity loss due to a mismatch in the minimized distortion function. Thanks to the methods introduced 
in Section 6.4.2, this loss can now be contrasted against the rate-distortion bound for the original measure D. However, 
one cannot build a practical scheme unless D can be written as a sum of local potentials. Next, the PI explains how to 
turn D into this form using the idea of a bounding distortion. 


6.6.2. Bounding distortion. Most features need in steganalysis can be written as a sum of locally-supported functions 
across the image 


(6.55) fix) = Sf"), R=1,....4 
cEC 
For example, the kth histogram bin of image x can be written using the Iverson bracket as 
(6.56) hy (x) = Ὁ ἴμκι = fl, 
τε 5 
while the klth element of a horizontal co-occurrence matrix 


m; ng-—l 


(6.57) Cx,i(x) = S> δ᾽ [as 5 = k\(2i541 ce ἰ] 


ἐπεὶ j=1 


is a sum over horizontally adjacent pixels (horizontal two-pixel cliques). For such locally-supported features, one can 
obtain an upper bound on D(y) = ||f(x) — f(y)||, y€ that has the required form: 


(6.58) ite) — Fy) = Sol Tf) - Ey) 
k=] cEc 
(6.59) < τ we δ |f (x) - ty) 
k=] cEC 
(6.60) =), 3 wel fE(x) — ££" (y)| 


cE€C k=) 


(6.61) =) V.(y), 
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where 


d 
(6.62) Vey) = >> wel 3 ) -- ΚΘ). 


k=} 


Following our convention explained in Section 6.1, the PI describes the methodology for a fixed cover image x and thus 
does not make the dependence of V, on x explicit. The sum δ ες Ve(y) will be called the bounding distortion. 

The PI now provides a specific example of this approach. The choice is motivated by the desire to work with a modern, 
well-established feature set so that later, in Section 6.7, one can validate the usefulness of the proposed framework 
by constructing a high-capacity steganographic method undetectable using current state-of-the-art steganalyzer. The 
motivation and justification of the feature set appears in [66]. It is a slight modification of the SPAM set [64], which is the 
basis of the current most reliable blind steganalyzer in the spatial domain. The features are constructed by considering 
the differences between neighboring pixels (e.g., horizontally adjacent pixels) as a higher-order Markov chain and taking 
the sample joint probability matrix (co-occurrence matrix) as the feature. The advantage of using the joint matrix instead 
of the transition probability matrix is that the norm of the feature difference can be readily upper-bounded by the desired 
local form (6.62). 

To formally define the feature for an πὶ x n2 image x, let us consider the following co-occurrence matrix computed 


from horizontal pixel differences D;* 3 (%) 5: δι 541 — Zig Ξε ἃ... π|ὶ,75]...., --ΙἸ: 
τι Ἴ-- 2 
(6.63) AniG) = > Σ, 2 (Pigs Pier) = (0) 


i=l j= 


For a in (6.63) the argument of the Iverson bracket is abbreviated from Dj¥(x) = k & Dz4,,(x) = ἰ to 
(Dz, Di341)(x) = (k, 1). Clearly, Az%(x) is the normalized count of neighboring triples of πων {πιο Vi j41,2ig+2} with 
differences 2;,;41 — 24 = k and 2x;,;42 — 24,341 =! in the entire image. The superscript arrow “+” denotes the fact that 
the differences are computed by subtracting the left pixel from the right one. Similarly, 


ny ng 


n et — 2) Σὺ Σ MDG PH) = (κ,})} 


t=1 7=3 


(6.64) Ax (x) = 


ony Di; (x) = ij 1 — z,;- By analogy, one can define vertical, diagonal, and minor diagonal matrices Ay be Ak fe 
Ag. 4 Ag by ἊΝ hy ΑΝ ,. All eight matrices are sample joint probabilities οὗ observing the differences k and | between three 


consecutive pixels along a certain direction. Due to the antisymmetry D;* f(x) = ij 1 54+1(x) only A,”, Ag ar Al κι" ἈΝ κὶ are 
needed since A,”, = Af, _,, and similarly for other matrices. 

Because neighboring pixels in natural images are strongly dependent, each matrix exhibits a sharp peak around (k,1) = 
(0,0) and then quickly falls off with increasing k and /. When such matrices are used for steganalysis [64], they are 
truncated to a small range, such as -T < k,l < T, T = 4, to prevent the onset of the “curse of dimensionality.” On the 
other hand, in steganography one can use large-dimensional models (T = 255) because it is easier to preserve a model 
than to learn it.2> Another reason for using a high-dimensional feature space is to avoid “overtraining” the embedding 
algorithm to a low-dimensional model as such algorithms may become detectable by a slightly modified feature set, an 
effect already reported in the DCT domain [56]. 


25Similar reasoning for constructing the distortion function was used in the HUGO algorithm [66]. 
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FIGURE 6.5. The union of all 12 cliques consisting of three pixels arranged in a straight line in the 5 x 5 
square neighborhood. 


By embedding a message, Aj’;(x) is modified to A;’,(y). The differences between the features will thus serve as a 
measure of embedding impact eee tied to the er (the indices i and 7 run from 1 to n, and 1,2 — 2, respectively): 


(6.65) |Aga(y) — Agu (x) = 


(6.66) πα Bh pl Pi Disa) = ἀρ) 
(6.67) προ (0) 
(6.68) 7 aoe aes _ 2) oN Dij+1)(y) = (k, ἰὴ] 
(6.69) τ᾿ ((Dz3,D i g+1( = (k, 1) 
(6.70) ipa pi 
cec~ 
where the PI defined the following locally-supported functions 
τὴν ὦ 

Hy’ (y) = σης Ξ 

(6.71) (Deh Di4a(y) = (FD) -- (DF, Dzj41)09 = (KDI 


on all horizontal cliques C* = {εἰς = {(2,7), (1,7 + 1), (6,2 + 2)}}. Notice that the absolute value had to be pulled into 
the sum to give the potentials a small support. 

Since the other three matrices can be written in this manner as well, one can write the distortion function in the 
following final form 


(6.72) D(y) = δ Vey), 
᾿ cEC 
now with C =C7* UC” UCT UCN, the set of three-pixel cliques along all four directions, and 
(6.73) ‘ary a wp tH (y), for each clique c € C™, 
κα 


and similarly for the other three clique types. Notice that there are again weights w;.; > 0 in the definition of V- that can 
be adjusted according to how sensitive steganalysis is to the individual differences. For example, if a certain difference 
pair (k,l) varies significantly over cover images, by assigning it a smaller weight one can allow it to be modified more 
often, while those differences that are stable across covers but sensitive to embedding should be intuitively assigned a 
larger value so that the embedding does not modify them too much. 

To complete the picture, the neighborhood system here is formed by 5 x 5 neighborhoods and thus the index set can 
be decomposed into nine disjoint sublattices S = U,, Sas, 1 < a,b < 3, 


(6.74) Sap = {(a + 3k, 64 3l)|1 Cat 3k 4πι,1 “Ὁ 8ὶ < no}. 


To better explain the effect of embedding changes on the distortion, realize that each pixel belongs to three horizontal, 
three vertical, three diagonal, and three minor-diagonal cliques. When a single pixel z;,; is changed, it affects only the 12 
potentials whose clique contains z;,;. For example if the original pixel values co = {2;,;,2i,j41, 2i,j+2} had differences k,l, 
and the pixel value changed from 7; ; to y;,; = ζει.) +1. Then, the pixel differences will be modified to k —1,/. Considering 
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Average error Pr 
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Relative payload a (bpp) 


FIGURE 6.6. Comparison of +1 embedding with optimal binary and ternary coding with binary embed- 
ding algorithms based on the Gibbs construction with a bounding distortion and the additive approxima- 
tion as described in Section 6.7.1. The error bars depict the minimum and maximum steganalyzer error 
Pe over five runs of SVM classifiers with different division of images into training and testing set. 


just the contribution from Hs > to the potential V., (6.73), it will increase by the sum of w,,, (the pair k,! is leaving 
cover) and w,—;, (a new pair appears in the stego image). 


6.6.3. Other options. The framework presented in this report allows the sender to formulate the local potentials directly 
instead of obtaining them as the bounding distortion. For example, the cliques and their potentials may be determined 
by the local image content or by learning the cover source using the method of fields of experts [70]. The merit of these 
possibilities can be evaluated by steganalyzers trained on a large set of images. The important question of optimizing the 
local potential functions w.r.t. statistical detectability is an important direction the PI intends to explore in the future. 


6.7. Experiments. In this section, the PI validates the proposed framework experimentally and includes a comparison 
between simple steganographic algorithms, such as binary and ternary +1 embedding and steganography implemented 
via the bounding distortion and the additive approximation (6.54). For the case of the bounding distortion, the capacity 
loss w.r.t. the optimal payload given by H(7)) is evaluated by means of the thermodynamic integration algorithm from 
Section 6.4.2. 


6.7.1. Tested embedding methods. For the methods based on additive approximation and the bounding distortion, the PI 
used as a feature vector the joint probability matrix Aj’, ,,(x) defined similarly as in (6.63) with the difference vector 
computed from four consecutive pixels (Dj, D741, Dj75}42) = (k,l,m). As above, four such matrices corresponding to 
four spatial directions were computed. The matrices were used at their full size Τ᾽ = 255 leading to model dimensionality 
d =4x 511° = 56.108. 

The weights were chosen to be small for those triples (Dz, D341, Dij42) = (k, l,m) that occur infrequently in images 
and large for frequented triples. Following the recommendation described in [66], since the frequency of occurrence of the 
triples falls off quickly with their norm, the weights are chosen as 


-9 
(6.75) ΕΝ - (σ εν εν), 


with 9= 1 and o = 1. The purpose of the weights is to force the embedding algorithm to modify those parts of the model 
that are difficult to model accurately, forcing thus the steganalyst to use a more accurate model. Here, the advantage goes 
to the steganographer, because preserving a high-dimensional feature vector is more feasible than accurately modeling it. 
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Because the neighborhood 7(2) in this case contains 7 x 7 pixels, the image was divided into 16 square sublattices on 
which embedding was carried out independently. The PI tested binary embedding, Z; = {z;,z;}, where x; was selected 
randomly and uniformly from {xz;—1,2;+ 1} and then fixed for all experiments with cover x. The payload-limited sender 
was simulated using the Gibbs sampler constrained to only two sweeps. Increasing the number of sweeps did not lead to 
further improvement. The curiously low number of sweeps sufficient to properly implement the Gibbs sampler is most 
likely due to the fact that the dependencies dictated by the bounding distortion are rather weak. The simulation of 
embedding for one image took less than 5 seconds when implemented in C++.o0n a single-processor PC. 

To summarize, the following four steganographic methods were tested: 

(1) Binary embedding using the Gibbs construction with sets Z; = {z,;,z} and bounding distortion (6. 72) of (6.53) 
with weights (6.75) for the d = 4 x 511°-dimensional feature space given by matrices Aj’, κι: Ag lm? Al ΜΝ ἌΝ me 

(2) Additive approximation (6.54) of (6.53) for the same sets Z;, feature space, and norm as in 1). 

(3) Binary +1 embedding with the same sets 7, equipped with a matrix embedding scheme operating on the binary 
bound. 

(4) Ternary +1 embedding with 7; = {z; -- 1, σε, πὶ +1} equipped with a ternary matrix embedding scheme operating 
on the ternary bound (the bounds appear, e.g., in [29]). 


Practical near-optimal codes for the two +1 embedding methods can be found in [31] and [98]. 


6.7.2. Testing methodology and final results. Following the separation principle, the PI now studies the security of all 
schemes when operating on the rate-distortion bound. All tests were carried out on the BOWS2 database [2] containing 
approximately 10800 grayscale images with a fixed size of 512 x 512 pixels coming from rescaled and cropped natural 
images of various sizes. Steganalysis was implemented using the second-order SPAM feature set with T = 3 [64]. The 
image database was evenly divided into a training and a testing set of cover and stego images, respectively. A soft- 
margin support-vector machine was trained using the Gaussian kernel. The kernel width and the penalty parameter were 
determined using five-fold cross validation on the grid (C,y) € {(10*,2°)|k Ε {—3,...,4},7 € {-L—-3,...,-L+3}}, 
where L = log, d is the binary logarithm of the number of features. 

The results are reported using the minimum average classification error Pg. Smaller values of Pe correspond to better 
steganalysis and thus larger statistical detectability (lower security). 

Figure 6.6 displays the comparison of all four embedding methods listed above. The methods based on the the 
bounding distortion and the additive approximation (denoted as “Bounding dist.” and “Additive approx.”) are completely 
undetectable for payloads smaller than 0.15 bpp, which suggests that the embedding changes are made in pixels not 
covered by the SPAM features. Since both schemes are binary with Z; = {2,,2{} with xz; randomly chosen from {z; — 
1,z; + 1}, they become equivalent to simple binary +1 embedding (Method 3) as a — 1 and thus become detectable. 
Comparing the capacity, both schemes allow communicating ten times larger payloads with Pe = 40% as compared 
to ternary +1 embedding. The advantage of using the Gibbs sampler with the bounding distortion over the additive 
approximation becomes more evident for larger payloads, where the embedding changes start to interact. This confirms 
the expectation that in this range the additive approximation is unable to cope with the interactions among changes and 
thus its detectability increases. This result, however, may change for different distortion measures and cover sources. 
The fact that the Gibbs sampler with bounding distortion did not bring a substantial performance improvement over the 
additive approximation indicates that the interactions among embedding changes are in general quite weak (at least as 
far as they are captured by the bounding distortion). The low strength of interactions also explains why only two sweeps 
of the Gibbs sampler were sufficient in practice. 


6.7.3. Analysis of upper bounds. As described in Section 6.5.3, Algorithm 2 for the payload-limited sender is unable to 
embed the optimal payload of H(7)) for three reasons. The performance may be affected bythe small number of sweeps 
of the Gibbs sampler, the parameter \ may vary slightly among the sublattices, and the algorithm embeds the erasure 
entropy H~(m,) < H(m)). The combined effect of these factors is of great importance for practitioners and is evaluated 
below for two images using the Gibbs sampler and the thermodynamic integration as explained in Section 6.4.2. 

Since the Gibbs construction depends on the cover image x, the PI present the results for two grayscale images of 
size 512 x 512 pixels coming from two different sources. The test image “O.png” is from the BOWS2 database and 
“Lenna” was obtained from http://en.wikipedia.org/wiki/File:Lenna.png and converted to grayscale using GNU 
Image Manipulation Program (GIMP). In both cases, the PI used the same sets Z; and the same feature set as in the 
previous section with the bounding distortion with weight parameters o = 1 and 6 = 1. 

The image “O.png” contains more areas with edges and textures than “Lenna” and thus for small distortions, it offers 
a larger capacity than “Lenna” because the weights (6.75) around edgés and complex texture are small. This is apparent 
from the slopes of the rate-distortion bounds in Figure 6.7. 
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FIGURE 6.7. Comparison of the payload loss of Algorithm 2 for cover images “O.png” and “Lenna” 
shown on the right. The rate-distortion bounds were obtained using the Gibbs sampler (6.38) and the 
thermodynamic integration (6.40). 


The same figure compares the rate-distortion performance of the payload-limited sender simulated by the Gibbs sampler 
with only two sweeps as described in Algorithm 2. For a given payload, the distortion was obtained as an average over 
100 random messages. The comparison shows that the payload loss of Algorithm 2 to the optimal Η (πλ) is quite small. 
Note that the erasure entropy, H~ (7), plotted in the figure has been computed over the sublattices after two sweeps and 
thus already contains the impact of all three factors discussed at the beginning of this section. 


6.8. Discussion. Currently, the most successful principle for designing practical steganographic systems that embed in 
empirical covers is based on minimizing a suitably defined distortion measure. Implementation difficulties and a lack of 
practical embedding methods have so far limited the application of this principle to a rather special class of distortion 
measures that are additive over pixels. With the development of near-optimal low-complexity coding schemes, such as 
the syndrome-trellis codes [26], this direction has essentially reached its limits. It is a firm belief of the PI that further 
substantial increase in secure payload is possible only when the sender uses adaptive schemes that place embedding changes 
based on the local content, that dare to modify pixels in some regions by more than 1, and that consider interactions 
among embedding changes while preserving higher-order statistics among pixels. This section describes a contribution 
which is an important step in this direction. 

It offers the steganographer a complete methodology for embedding while minimizing an arbitrarily defined distortion 
measure D. The absence of any restrictions on D means that the remaining task left to the sender is to find a distortion 
measure that correlates with statistical detectability. An appealing possibility is to define D as a weighted norm of 
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the difference between cover and stego feature vectors used in steganalysis. This immediately connects the principle of 
minimum-distortion steganography with the concept of model preservation which has so far been limited to low-dimensional 
models. Being able to preserve a large-dimensional model gives the steganographer a great advantage over the steganalyst 
because of the difficulties associated with learning a high-dimensional cover source model using statistical learning tools. 

The proposed framework is called the Gibbs construction and it connects steganography with statistical physics, which 
contributed with many practical algorithms. In particular, the Gibbs sampler combined with the thermodynamic integra- 
tion can be used to derive the rate-distortion bound, simulate the impact of optimal embedding, and realize near-optimal 
embedding algorithms. These three tasks can be addressed separately (the so-called “separation principle”) giving the 
sender a great amount of design flexibility as well as control over losses of practical schemes. 

An important case elaborated in this section corresponds to D defined as a sum of local potentials over small pixel 
neighborhoods. Here, the optimal distribution of embedding modifications reduces to a Markov random field and the Gibbs 
sampler can be turned into a practical embedding algorithm able to consider dependencies among embedding changes. 
When D cannot be written as a sum of local potentials, practical (suboptimal) methods can be realized by approximating 
D either with an additive distortion measure or with local potentials. The problem of finding the best approximation for 
a given non-local D is of its own interest. The PI did not cover the task of minimizing the statistical detectability with 
respect to the distortion function completely due to its inherent complexity; it is left as part of future effort. 

The proposed methodology was described both for a payload-limited sender and the distortion-limited sender. The 
former embeds a fixed payload in every image with minima] distortion, while the latter embeds the maximal payload 
for a given distortion in every image. The distortion-limited sender better corresponds to the intuition that, for a fixed 
statistical detectability, more textured or noisy images can carry a larger secure payload than smoother or simpler images. 
The fact that the size of the hidden message is driven by the cover image essentially represents a more realistic case of 
the batch steganography paradigm [50]. 

Note that the distortion measure is used only by the sender and thus does not need to be shared. The only information 
needed by the receiver to decode the message is its size which can be communicated separately in the same cover image. 
This opens up the intriguing possibility to develop embedding schemes able to learn the proper distortion function while 
observing the impact of embedding on the cover source. 

Finally, the proposed methodology can be applied to other data hiding problems where the statistical detectability 
constraint could be replaced by a perceptual distortion constraint. The implicit premise of this work is the direct rela- 
tionship between the distortion function D and statistical detectability. Designing (and possibly learning) the distortion 
measure for a given cover source is an interesting problem by itself and was addressed by the PI in her last publication [24] 
developed under this effort and made possible by the No-Cost Extension. C++ implementation with Matlab wrappers of 
STCs and multi-layered STCs are available at http: //dde. binghamton. edu/download/syndrome/. 
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